* [RFC PATCH 0/5] The power allocator thermal governor
@ 2014-05-06 12:06 Javi Merino
2014-05-06 12:06 ` [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone Javi Merino
` (5 more replies)
0 siblings, 6 replies; 17+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
To: linux-pm; +Cc: Punit.Agrawal, Javi Merino
Hi,
This RFC is our first stab at implementing the ideas we proposed in
[1]. The power allocator governor allocates device power to control
temperature. This requires transforming performance requests into
requested power, which we do with the aid of power models. Patch 3
(thermal: add a basic cpu power actor) implements a simple power model
for cpus. The division of power between the actors ensures that power
is allocated where it is needed the most, based on the current
workload.
[1] http://article.gmane.org/gmane.linux.power-management.general/43243
Patches 1 and 2 were previously sent as separate patches: [2] and
[3]. There was a similar patch to "thermal: provide a way for the
governors to have private data for each thermal zone" posted by Wei Ni
but we can't use it as it is because it doesn't allow the governor's
bind() function (start() in Wei's patch) to modify the governor_data.
Other than that, this patch incorporates Rui's suggestion of keeping
the original governor if the new governor fails to bind.
[2] http://article.gmane.org/gmane.linux.power-management.general/43624
[3] https://patchwork.kernel.org/patch/3859721/
It's an RFC because we still need to tune it to achieve the
temperature stability that we are aiming for. This RFC is just meant
to remove major road blocks as early as possible (if any).
We'll provide integration instructions soon so that people can test it
in their platforms.
Cheers,
Javi & Punit
Javi Merino (5):
thermal: let governors have private data for each thermal zone
thermal: let cooling devices operate on other units other than state
thermal: add a basic cpu power actor
thermal: introduce the Power Allocator governor
thermal: add trace events to the power allocator governor
drivers/acpi/fan.c | 22 +-
drivers/acpi/processor_thermal.c | 23 +-
drivers/acpi/video.c | 22 +-
drivers/platform/x86/acerhdf.c | 24 +-
drivers/platform/x86/intel_menlow.c | 26 +-
drivers/power/power_supply_core.c | 24 +-
drivers/thermal/Kconfig | 14 +
drivers/thermal/Makefile | 1 +
drivers/thermal/cpu_cooling.c | 427 +++++++++++++++++++++++++++++--
drivers/thermal/db8500_thermal.c | 2 +-
drivers/thermal/fair_share.c | 2 +-
drivers/thermal/intel_powerclamp.c | 24 +-
drivers/thermal/power_allocator.c | 413 ++++++++++++++++++++++++++++++
drivers/thermal/step_wise.c | 2 +-
drivers/thermal/thermal_core.c | 91 +++++--
drivers/thermal/thermal_core.h | 8 +
include/linux/cpu_cooling.h | 28 ++
include/linux/thermal.h | 25 +-
include/trace/events/thermal.h | 34 +++
include/trace/events/thermal_governor.h | 35 +++
20 files changed, 1153 insertions(+), 94 deletions(-)
create mode 100644 drivers/thermal/power_allocator.c
create mode 100644 include/trace/events/thermal.h
create mode 100644 include/trace/events/thermal_governor.h
--
1.7.9.5
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone
2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
@ 2014-05-06 12:06 ` Javi Merino
2014-05-13 14:31 ` edubezval
2014-05-06 12:06 ` [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state Javi Merino
` (4 subsequent siblings)
5 siblings, 1 reply; 17+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
To: linux-pm; +Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin
A governor may need to store its current state between calls to
throttle(). That state depends on the thermal zone, so store it as
private data in struct thermal_zone_device.
The governors may have two new ops: bind_to_tz() and unbind_from_tz().
When provided, these functions let governors do some initialization
and teardown when they are bound/unbound to a tz and possibly store that
information in the governor_data field of the struct
thermal_zone_device.
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
drivers/thermal/thermal_core.c | 67 +++++++++++++++++++++++++++++++++++-----
include/linux/thermal.h | 3 ++
2 files changed, 62 insertions(+), 8 deletions(-)
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 71b0ec0c370d..ac8e5990a63e 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -72,6 +72,45 @@ static struct thermal_governor *__find_governor(const char *name)
return NULL;
}
+/**
+ * thermal_set_governor() - Switch to another governor
+ * @tz: a valid pointer to a struct thermal_zone_device
+ * @new_gov: pointer to the new governor
+ *
+ * Change the governor of thermal zone @tz.
+ *
+ * Returns 0 on success, an error if the new governor's bind_to_tz() failed.
+ */
+static int thermal_set_governor(struct thermal_zone_device *tz,
+ struct thermal_governor *new_gov)
+{
+ int ret = 0;
+
+ if (tz->governor && tz->governor->unbind_from_tz)
+ tz->governor->unbind_from_tz(tz);
+
+ if (new_gov && new_gov->bind_to_tz) {
+ ret = new_gov->bind_to_tz(tz);
+ if (ret) {
+ /* Try to register the old governor again */
+ if (tz->governor && tz->governor->bind_to_tz) {
+ if (tz->governor->bind_to_tz(tz))
+ /*
+ * The new governor failed to register
+ * and the previous one failed as well
+ */
+ tz->governor = NULL;
+ }
+
+ return ret;
+ }
+ }
+
+ tz->governor = new_gov;
+
+ return ret;
+}
+
int thermal_register_governor(struct thermal_governor *governor)
{
int err;
@@ -104,8 +143,12 @@ int thermal_register_governor(struct thermal_governor *governor)
name = pos->tzp->governor_name;
- if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
- pos->governor = governor;
+ if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
+ int ret = thermal_set_governor(pos, governor);
+ if (ret)
+ pr_warn("Failed to set governor %s for zone %d: %d\n",
+ governor->name, pos->id, ret);
+ }
}
mutex_unlock(&thermal_list_lock);
@@ -131,7 +174,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
list_for_each_entry(pos, &thermal_tz_list, node) {
if (!strnicmp(pos->governor->name, governor->name,
THERMAL_NAME_LENGTH))
- pos->governor = NULL;
+ thermal_set_governor(pos, NULL);
}
mutex_unlock(&thermal_list_lock);
@@ -756,8 +799,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
if (!gov)
goto exit;
- tz->governor = gov;
- ret = count;
+ ret = thermal_set_governor(tz, gov);
+ if (!ret)
+ ret = count;
exit:
mutex_unlock(&thermal_governor_lock);
@@ -1452,6 +1496,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
int result;
int count;
int passive = 0;
+ struct thermal_governor *governor;
if (type && strlen(type) >= THERMAL_NAME_LENGTH)
return ERR_PTR(-EINVAL);
@@ -1542,9 +1587,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
mutex_lock(&thermal_governor_lock);
if (tz->tzp)
- tz->governor = __find_governor(tz->tzp->governor_name);
+ governor = __find_governor(tz->tzp->governor_name);
else
- tz->governor = def_governor;
+ governor = def_governor;
+
+ result = thermal_set_governor(tz, governor);
+ if (result) {
+ mutex_unlock(&thermal_governor_lock);
+ goto unregister;
+ }
mutex_unlock(&thermal_governor_lock);
@@ -1634,7 +1685,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
device_remove_file(&tz->device, &dev_attr_mode);
device_remove_file(&tz->device, &dev_attr_policy);
remove_trip_attrs(tz);
- tz->governor = NULL;
+ thermal_set_governor(tz, NULL);
thermal_remove_hwmon_sysfs(tz);
release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index f7e11c7ea7d9..baac212f815e 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -177,6 +177,7 @@ struct thermal_zone_device {
struct thermal_zone_device_ops *ops;
const struct thermal_zone_params *tzp;
struct thermal_governor *governor;
+ void *governor_data;
struct list_head thermal_instances;
struct idr idr;
struct mutex lock; /* protect thermal_instances list */
@@ -187,6 +188,8 @@ struct thermal_zone_device {
/* Structure that holds thermal governor information */
struct thermal_governor {
char name[THERMAL_NAME_LENGTH];
+ int (*bind_to_tz)(struct thermal_zone_device *tz);
+ void (*unbind_from_tz)(struct thermal_zone_device *tz);
int (*throttle)(struct thermal_zone_device *tz, int trip);
struct list_head governor_list;
};
--
1.7.9.5
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state
2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
2014-05-06 12:06 ` [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone Javi Merino
@ 2014-05-06 12:06 ` Javi Merino
2014-05-13 15:22 ` edubezval
2014-05-06 12:06 ` [RFC PATCH 3/5] thermal: add a basic cpu power actor Javi Merino
` (3 subsequent siblings)
5 siblings, 1 reply; 17+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
To: linux-pm; +Cc: Punit.Agrawal, Javi Merino, Zhang Rui
Currently, cooling devices operate on state, even though some (like
the cpufreq cooling device) have exported functions to convert
frequency to state. Generalize this interface so that governors can
operate on other units (e.g. frequency, power,...).
Suggested-by: Eduardo Valentin <edubezval@gmail.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
drivers/acpi/fan.c | 22 ++++++++++++++------
drivers/acpi/processor_thermal.c | 23 +++++++++++++++------
drivers/acpi/video.c | 22 ++++++++++++++------
drivers/platform/x86/acerhdf.c | 24 ++++++++++++++++------
drivers/platform/x86/intel_menlow.c | 26 +++++++++++++++++-------
drivers/power/power_supply_core.c | 24 ++++++++++++++++------
drivers/thermal/cpu_cooling.c | 38 ++++++++++++++++++++++++++---------
drivers/thermal/db8500_thermal.c | 2 +-
drivers/thermal/fair_share.c | 2 +-
drivers/thermal/intel_powerclamp.c | 24 ++++++++++++++++------
drivers/thermal/step_wise.c | 2 +-
drivers/thermal/thermal_core.c | 13 ++++++------
include/linux/thermal.h | 13 +++++++++---
13 files changed, 169 insertions(+), 66 deletions(-)
diff --git a/drivers/acpi/fan.c b/drivers/acpi/fan.c
index 8acf53e62966..55346deb8202 100644
--- a/drivers/acpi/fan.c
+++ b/drivers/acpi/fan.c
@@ -79,20 +79,26 @@ static struct acpi_driver acpi_fan_driver = {
/* thermal cooling device callbacks */
static int fan_get_max_state(struct thermal_cooling_device *cdev, unsigned long
- *state)
+ *state, enum thermal_cooling_unit unit)
{
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
/* ACPI fan device only support two states: ON/OFF */
*state = 1;
return 0;
}
static int fan_get_cur_state(struct thermal_cooling_device *cdev, unsigned long
- *state)
+ *state, enum thermal_cooling_unit unit)
{
struct acpi_device *device = cdev->devdata;
int result;
int acpi_state = ACPI_STATE_D0;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
if (!device)
return -EINVAL;
@@ -106,11 +112,15 @@ static int fan_get_cur_state(struct thermal_cooling_device *cdev, unsigned long
}
static int
-fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state)
+fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state,
+ enum thermal_cooling_unit unit)
{
struct acpi_device *device = cdev->devdata;
int result;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
if (!device || (state != 0 && state != 1))
return -EINVAL;
@@ -121,9 +131,9 @@ fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state)
}
static const struct thermal_cooling_device_ops fan_cooling_ops = {
- .get_max_state = fan_get_max_state,
- .get_cur_state = fan_get_cur_state,
- .set_cur_state = fan_set_cur_state,
+ .get_max = fan_get_max_state,
+ .get_cur = fan_get_cur_state,
+ .set_cur = fan_set_cur_state,
};
/* --------------------------------------------------------------------------
diff --git a/drivers/acpi/processor_thermal.c b/drivers/acpi/processor_thermal.c
index e003663b2f8e..48179ed9afe5 100644
--- a/drivers/acpi/processor_thermal.c
+++ b/drivers/acpi/processor_thermal.c
@@ -200,11 +200,15 @@ static int acpi_processor_max_state(struct acpi_processor *pr)
}
static int
processor_get_max_state(struct thermal_cooling_device *cdev,
- unsigned long *state)
+ unsigned long *state,
+ enum thermal_cooling_unit unit)
{
struct acpi_device *device = cdev->devdata;
struct acpi_processor *pr;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
if (!device)
return -EINVAL;
@@ -218,11 +222,15 @@ processor_get_max_state(struct thermal_cooling_device *cdev,
static int
processor_get_cur_state(struct thermal_cooling_device *cdev,
- unsigned long *cur_state)
+ unsigned long *cur_state,
+ enum thermal_cooling_unit unit)
{
struct acpi_device *device = cdev->devdata;
struct acpi_processor *pr;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
if (!device)
return -EINVAL;
@@ -238,13 +246,16 @@ processor_get_cur_state(struct thermal_cooling_device *cdev,
static int
processor_set_cur_state(struct thermal_cooling_device *cdev,
- unsigned long state)
+ unsigned long state, enum thermal_cooling_unit unit)
{
struct acpi_device *device = cdev->devdata;
struct acpi_processor *pr;
int result = 0;
int max_pstate;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
if (!device)
return -EINVAL;
@@ -270,7 +281,7 @@ processor_set_cur_state(struct thermal_cooling_device *cdev,
}
const struct thermal_cooling_device_ops processor_cooling_ops = {
- .get_max_state = processor_get_max_state,
- .get_cur_state = processor_get_cur_state,
- .set_cur_state = processor_set_cur_state,
+ .get_max = processor_get_max_state,
+ .get_cur = processor_get_cur_state,
+ .set_cur = processor_set_cur_state,
};
diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
index 8b6990e417ec..f0aeb4d1ba91 100644
--- a/drivers/acpi/video.c
+++ b/drivers/acpi/video.c
@@ -282,23 +282,29 @@ static const struct backlight_ops acpi_backlight_ops = {
/* thermal cooling device callbacks */
static int video_get_max_state(struct thermal_cooling_device *cooling_dev, unsigned
- long *state)
+ long *state, enum thermal_cooling_unit unit)
{
struct acpi_device *device = cooling_dev->devdata;
struct acpi_video_device *video = acpi_driver_data(device);
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
*state = video->brightness->count - 3;
return 0;
}
static int video_get_cur_state(struct thermal_cooling_device *cooling_dev, unsigned
- long *state)
+ long *state, enum thermal_cooling_unit unit)
{
struct acpi_device *device = cooling_dev->devdata;
struct acpi_video_device *video = acpi_driver_data(device);
unsigned long long level;
int offset;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
if (acpi_video_device_lcd_get_level_current(video, &level, false))
return -EINVAL;
for (offset = 2; offset < video->brightness->count; offset++)
@@ -311,12 +317,16 @@ static int video_get_cur_state(struct thermal_cooling_device *cooling_dev, unsig
}
static int
-video_set_cur_state(struct thermal_cooling_device *cooling_dev, unsigned long state)
+video_set_cur_state(struct thermal_cooling_device *cooling_dev,
+ unsigned long state, enum thermal_cooling_unit unit)
{
struct acpi_device *device = cooling_dev->devdata;
struct acpi_video_device *video = acpi_driver_data(device);
int level;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
if (state >= video->brightness->count - 2)
return -EINVAL;
@@ -326,9 +336,9 @@ video_set_cur_state(struct thermal_cooling_device *cooling_dev, unsigned long st
}
static const struct thermal_cooling_device_ops video_cooling_ops = {
- .get_max_state = video_get_max_state,
- .get_cur_state = video_get_cur_state,
- .set_cur_state = video_set_cur_state,
+ .get_max = video_get_max_state,
+ .get_cur = video_get_cur_state,
+ .set_cur = video_set_cur_state,
};
/*
diff --git a/drivers/platform/x86/acerhdf.c b/drivers/platform/x86/acerhdf.c
index f94467c05225..0bb85e240d93 100644
--- a/drivers/platform/x86/acerhdf.c
+++ b/drivers/platform/x86/acerhdf.c
@@ -439,18 +439,26 @@ static struct thermal_zone_device_ops acerhdf_dev_ops = {
* get maximal fan cooling state
*/
static int acerhdf_get_max_state(struct thermal_cooling_device *cdev,
- unsigned long *state)
+ unsigned long *state,
+ enum thermal_cooling_unit unit)
{
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
*state = 1;
return 0;
}
static int acerhdf_get_cur_state(struct thermal_cooling_device *cdev,
- unsigned long *state)
+ unsigned long *state,
+ enum thermal_cooling_unit unit)
{
int err = 0, tmp;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
err = acerhdf_get_fanstate(&tmp);
if (err)
return err;
@@ -461,10 +469,14 @@ static int acerhdf_get_cur_state(struct thermal_cooling_device *cdev,
/* change current fan state - is overwritten when running in kernel mode */
static int acerhdf_set_cur_state(struct thermal_cooling_device *cdev,
- unsigned long state)
+ unsigned long state,
+ enum thermal_cooling_unit unit)
{
int cur_temp, cur_state, err = 0;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
if (!kernelmode)
return 0;
@@ -498,9 +510,9 @@ err_out:
/* bind fan callbacks to fan device */
static struct thermal_cooling_device_ops acerhdf_cooling_ops = {
- .get_max_state = acerhdf_get_max_state,
- .get_cur_state = acerhdf_get_cur_state,
- .set_cur_state = acerhdf_set_cur_state,
+ .get_max = acerhdf_get_max_state,
+ .get_cur = acerhdf_get_cur_state,
+ .set_cur = acerhdf_set_cur_state,
};
/* suspend / resume functionality */
diff --git a/drivers/platform/x86/intel_menlow.c b/drivers/platform/x86/intel_menlow.c
index e8b46d2c468c..9792da407c46 100644
--- a/drivers/platform/x86/intel_menlow.c
+++ b/drivers/platform/x86/intel_menlow.c
@@ -61,7 +61,8 @@ static void intel_menlow_unregister_sensor(void);
* GTHS returning '0' would mean that no bandwidth control states are supported
*/
static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
- unsigned long *max_state)
+ unsigned long *max_state,
+ enum thermal_cooling_unit unit)
{
struct acpi_device *device = cdev->devdata;
acpi_handle handle = device->handle;
@@ -70,6 +71,9 @@ static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
union acpi_object arg;
acpi_status status = AE_OK;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
arg_list.count = 1;
arg_list.pointer = &arg;
arg.type = ACPI_TYPE_INTEGER;
@@ -87,7 +91,8 @@ static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
}
static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
- unsigned long *value)
+ unsigned long *value,
+ enum thermal_cooling_unit unit)
{
struct acpi_device *device = cdev->devdata;
acpi_handle handle = device->handle;
@@ -96,6 +101,9 @@ static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
union acpi_object arg;
acpi_status status = AE_OK;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
arg_list.count = 1;
arg_list.pointer = &arg;
arg.type = ACPI_TYPE_INTEGER;
@@ -110,7 +118,8 @@ static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
}
static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
- unsigned long state)
+ unsigned long state,
+ enum thermal_cooling_unit unit)
{
struct acpi_device *device = cdev->devdata;
acpi_handle handle = device->handle;
@@ -120,7 +129,10 @@ static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
unsigned long long temp;
unsigned long max_state;
- if (memory_get_max_bandwidth(cdev, &max_state))
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
+ if (memory_get_max_bandwidth(cdev, &max_state, THERMAL_UNIT_STATE))
return -EFAULT;
if (state > max_state)
@@ -143,9 +155,9 @@ static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
}
static struct thermal_cooling_device_ops memory_cooling_ops = {
- .get_max_state = memory_get_max_bandwidth,
- .get_cur_state = memory_get_cur_bandwidth,
- .set_cur_state = memory_set_cur_bandwidth,
+ .get_max = memory_get_max_bandwidth,
+ .get_cur = memory_get_cur_bandwidth,
+ .set_cur = memory_set_cur_bandwidth,
};
/*
diff --git a/drivers/power/power_supply_core.c b/drivers/power/power_supply_core.c
index 26606641fe44..69dc2901a788 100644
--- a/drivers/power/power_supply_core.c
+++ b/drivers/power/power_supply_core.c
@@ -440,12 +440,16 @@ static void psy_unregister_thermal(struct power_supply *psy)
/* thermal cooling device callbacks */
static int ps_get_max_charge_cntl_limit(struct thermal_cooling_device *tcd,
- unsigned long *state)
+ unsigned long *state,
+ enum thermal_cooling_unit unit)
{
struct power_supply *psy;
union power_supply_propval val;
int ret;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
psy = tcd->devdata;
ret = psy->get_property(psy,
POWER_SUPPLY_PROP_CHARGE_CONTROL_LIMIT_MAX, &val);
@@ -456,12 +460,16 @@ static int ps_get_max_charge_cntl_limit(struct thermal_cooling_device *tcd,
}
static int ps_get_cur_chrage_cntl_limit(struct thermal_cooling_device *tcd,
- unsigned long *state)
+ unsigned long *state,
+ enum thermal_cooling_unit unit)
{
struct power_supply *psy;
union power_supply_propval val;
int ret;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
psy = tcd->devdata;
ret = psy->get_property(psy,
POWER_SUPPLY_PROP_CHARGE_CONTROL_LIMIT, &val);
@@ -472,12 +480,16 @@ static int ps_get_cur_chrage_cntl_limit(struct thermal_cooling_device *tcd,
}
static int ps_set_cur_charge_cntl_limit(struct thermal_cooling_device *tcd,
- unsigned long state)
+ unsigned long state,
+ enum thermal_cooling_unit unit)
{
struct power_supply *psy;
union power_supply_propval val;
int ret;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
psy = tcd->devdata;
val.intval = state;
ret = psy->set_property(psy,
@@ -487,9 +499,9 @@ static int ps_set_cur_charge_cntl_limit(struct thermal_cooling_device *tcd,
}
static struct thermal_cooling_device_ops psy_tcd_ops = {
- .get_max_state = ps_get_max_charge_cntl_limit,
- .get_cur_state = ps_get_cur_chrage_cntl_limit,
- .set_cur_state = ps_set_cur_charge_cntl_limit,
+ .get_max = ps_get_max_charge_cntl_limit,
+ .get_cur = ps_get_cur_chrage_cntl_limit,
+ .set_cur = ps_set_cur_charge_cntl_limit,
};
static int psy_register_cooler(struct power_supply *psy)
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 4246262c4bd2..f3f4e6b5798e 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -348,14 +348,17 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
* cpufreq_get_max_state - callback function to get the max cooling state.
* @cdev: thermal cooling device pointer.
* @state: fill this variable with the max cooling state.
+ * @unit: the units in which @state should be in.
*
- * Callback for the thermal cooling device to return the cpufreq
- * max cooling state.
+ * Callback for the thermal cooling device to return the cpufreq max
+ * cooling state. Currently only THERMAL_UNIT_STATE is valid for
+ * @unit.
*
* Return: 0 on success, an error code otherwise.
*/
static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
- unsigned long *state)
+ unsigned long *state,
+ enum thermal_cooling_unit unit)
{
struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
struct cpumask *mask = &cpufreq_device->allowed_cpus;
@@ -363,6 +366,9 @@ static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
unsigned int count = 0;
int ret;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
cpu = cpumask_any(mask);
ret = get_property(cpu, 0, &count, GET_MAXL);
@@ -377,17 +383,23 @@ static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
* cpufreq_get_cur_state - callback function to get the current cooling state.
* @cdev: thermal cooling device pointer.
* @state: fill this variable with the current cooling state.
+ * @unit: the units in which state should be in
*
* Callback for the thermal cooling device to return the cpufreq
- * current cooling state.
+ * current cooling state. Currently only THERMAL_UNIT_STATE is valid
+ * for @unit.
*
* Return: 0 on success, an error code otherwise.
*/
static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
- unsigned long *state)
+ unsigned long *state,
+ enum thermal_cooling_unit unit)
{
struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
*state = cpufreq_device->cpufreq_state;
return 0;
@@ -397,25 +409,31 @@ static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
* cpufreq_set_cur_state - callback function to set the current cooling state.
* @cdev: thermal cooling device pointer.
* @state: set this variable to the current cooling state.
+ * @unit: the units in which @state should be in.
*
* Callback for the thermal cooling device to change the cpufreq
- * current cooling state.
+ * current cooling state. Currently only THERMAL_UNIT_STATE is valid for
+ * @unit.
*
* Return: 0 on success, an error code otherwise.
*/
static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
- unsigned long state)
+ unsigned long state,
+ enum thermal_cooling_unit unit)
{
struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
return cpufreq_apply_cooling(cpufreq_device, state);
}
/* Bind cpufreq callbacks to thermal cooling device ops */
static struct thermal_cooling_device_ops const cpufreq_cooling_ops = {
- .get_max_state = cpufreq_get_max_state,
- .get_cur_state = cpufreq_get_cur_state,
- .set_cur_state = cpufreq_set_cur_state,
+ .get_max = cpufreq_get_max_state,
+ .get_cur = cpufreq_get_cur_state,
+ .set_cur = cpufreq_set_cur_state,
};
/* Notifier for cpufreq policy change */
diff --git a/drivers/thermal/db8500_thermal.c b/drivers/thermal/db8500_thermal.c
index 1e3b3bf9f993..0842a2ef88b4 100644
--- a/drivers/thermal/db8500_thermal.c
+++ b/drivers/thermal/db8500_thermal.c
@@ -67,7 +67,7 @@ static int db8500_cdev_bind(struct thermal_zone_device *thermal,
unsigned long max_state, upper, lower;
int i, ret = -EINVAL;
- cdev->ops->get_max_state(cdev, &max_state);
+ cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
for (i = 0; i < ptrips->num_trips; i++) {
if (db8500_thermal_match_cdev(cdev, &ptrips->trip_points[i]))
diff --git a/drivers/thermal/fair_share.c b/drivers/thermal/fair_share.c
index 944ba2f340c8..8e6eb4883bb6 100644
--- a/drivers/thermal/fair_share.c
+++ b/drivers/thermal/fair_share.c
@@ -51,7 +51,7 @@ static long get_target_state(struct thermal_zone_device *tz,
{
unsigned long max_state;
- cdev->ops->get_max_state(cdev, &max_state);
+ cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
return (long)(weight * level * max_state) / (100 * tz->trips);
}
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index a084325f1386..c32bee16b17a 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -616,16 +616,24 @@ static struct notifier_block powerclamp_cpu_notifier = {
};
static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
- unsigned long *state)
+ unsigned long *state,
+ enum thermal_cooling_unit unit)
{
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
*state = MAX_TARGET_RATIO;
return 0;
}
static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
- unsigned long *state)
+ unsigned long *state,
+ enum thermal_cooling_unit unit)
{
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
if (true == clamping)
*state = pkg_cstate_ratio_cur;
else
@@ -636,10 +644,14 @@ static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
}
static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
- unsigned long new_target_ratio)
+ unsigned long new_target_ratio,
+ enum thermal_cooling_unit unit)
{
int ret = 0;
+ if (unit != THERMAL_UNIT_STATE)
+ return -EINVAL;
+
new_target_ratio = clamp(new_target_ratio, 0UL,
(unsigned long) (MAX_TARGET_RATIO-1));
if (set_target_ratio == 0 && new_target_ratio > 0) {
@@ -663,9 +675,9 @@ exit_set:
/* bind to generic thermal layer as cooling device*/
static struct thermal_cooling_device_ops powerclamp_cooling_ops = {
- .get_max_state = powerclamp_get_max_state,
- .get_cur_state = powerclamp_get_cur_state,
- .set_cur_state = powerclamp_set_cur_state,
+ .get_max = powerclamp_get_max_state,
+ .get_cur = powerclamp_get_cur_state,
+ .set_cur = powerclamp_set_cur_state,
};
/* runs on Nehalem and later */
diff --git a/drivers/thermal/step_wise.c b/drivers/thermal/step_wise.c
index f251521baaa2..2a2cd4eba3c3 100644
--- a/drivers/thermal/step_wise.c
+++ b/drivers/thermal/step_wise.c
@@ -58,7 +58,7 @@ static unsigned long get_target_state(struct thermal_instance *instance,
* Otherwise, we use the current state of the
* cdev in use to determine the next_target.
*/
- cdev->ops->get_cur_state(cdev, &cur_state);
+ cdev->ops->get_cur(cdev, &cur_state, THERMAL_UNIT_STATE);
next_target = instance->target;
dev_dbg(&cdev->device, "cur_state=%ld\n", cur_state);
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index ac8e5990a63e..5b46603cc7cd 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -871,7 +871,7 @@ thermal_cooling_device_max_state_show(struct device *dev,
unsigned long state;
int ret;
- ret = cdev->ops->get_max_state(cdev, &state);
+ ret = cdev->ops->get_max(cdev, &state, THERMAL_UNIT_STATE);
if (ret)
return ret;
return sprintf(buf, "%ld\n", state);
@@ -885,7 +885,7 @@ thermal_cooling_device_cur_state_show(struct device *dev,
unsigned long state;
int ret;
- ret = cdev->ops->get_cur_state(cdev, &state);
+ ret = cdev->ops->get_cur(cdev, &state, THERMAL_UNIT_STATE);
if (ret)
return ret;
return sprintf(buf, "%ld\n", state);
@@ -906,7 +906,7 @@ thermal_cooling_device_cur_state_store(struct device *dev,
if ((long)state < 0)
return -EINVAL;
- result = cdev->ops->set_cur_state(cdev, state);
+ result = cdev->ops->set_cur(cdev, state, THERMAL_UNIT_STATE);
if (result)
return result;
return count;
@@ -983,7 +983,7 @@ int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz,
if (tz != pos1 || cdev != pos2)
return -EINVAL;
- cdev->ops->get_max_state(cdev, &max_state);
+ cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
/* lower default 0, upper default max_state */
lower = lower == THERMAL_NO_LIMIT ? 0 : lower;
@@ -1143,8 +1143,7 @@ __thermal_cooling_device_register(struct device_node *np,
if (type && strlen(type) >= THERMAL_NAME_LENGTH)
return ERR_PTR(-EINVAL);
- if (!ops || !ops->get_max_state || !ops->get_cur_state ||
- !ops->set_cur_state)
+ if (!ops || !ops->get_max || !ops->get_cur || !ops->set_cur)
return ERR_PTR(-EINVAL);
cdev = kzalloc(sizeof(struct thermal_cooling_device), GFP_KERNEL);
@@ -1329,7 +1328,7 @@ void thermal_cdev_update(struct thermal_cooling_device *cdev)
target = instance->target;
}
mutex_unlock(&cdev->lock);
- cdev->ops->set_cur_state(cdev, target);
+ cdev->ops->set_cur(cdev, target, THERMAL_UNIT_STATE);
cdev->updated = true;
dev_dbg(&cdev->device, "set to state %lu\n", target);
}
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index baac212f815e..8d183b8255eb 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -106,6 +106,10 @@ enum {
};
#define THERMAL_GENL_CMD_MAX (__THERMAL_GENL_CMD_MAX - 1)
+enum thermal_cooling_unit {
+ THERMAL_UNIT_STATE,
+};
+
struct thermal_zone_device_ops {
int (*bind) (struct thermal_zone_device *,
struct thermal_cooling_device *);
@@ -135,9 +139,12 @@ struct thermal_zone_device_ops {
};
struct thermal_cooling_device_ops {
- int (*get_max_state) (struct thermal_cooling_device *, unsigned long *);
- int (*get_cur_state) (struct thermal_cooling_device *, unsigned long *);
- int (*set_cur_state) (struct thermal_cooling_device *, unsigned long);
+ int (*get_max)(struct thermal_cooling_device *, unsigned long *,
+ enum thermal_cooling_unit);
+ int (*get_cur)(struct thermal_cooling_device *, unsigned long *,
+ enum thermal_cooling_unit);
+ int (*set_cur)(struct thermal_cooling_device *, unsigned long,
+ enum thermal_cooling_unit);
};
struct thermal_cooling_device {
--
1.7.9.5
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH 3/5] thermal: add a basic cpu power actor
2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
2014-05-06 12:06 ` [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone Javi Merino
2014-05-06 12:06 ` [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state Javi Merino
@ 2014-05-06 12:06 ` Javi Merino
2014-05-13 22:57 ` Eduardo Valentin
2014-05-06 12:06 ` [RFC PATCH 4/5] thermal: introduce the Power Allocator governor Javi Merino
` (2 subsequent siblings)
5 siblings, 1 reply; 17+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
To: linux-pm
Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin,
Punit Agrawal
Introduce a new thermal_cooling_unit: THERMAL_UNIT_POWER and let
cpu_cooling device operate using it as well as "states". This allows
governors that call cpufreq_get_max/cpufreq_get_cur/cpufreq_set_cur()
with THERMAL_UNIT_POWER to use cpufreq cooling devices.
If the cpu cooling device is registered using
{of_,}cpufreq_power_actor_register() instead of
{of_,}cpufreq_cooling_register() it uses the current frequency (as
reported by cpufreq) as well as load and OPPs for the power
calculations. The cpus must have registered their OPPs in the OPP
library.
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
drivers/thermal/cpu_cooling.c | 430 ++++++++++++++++++++++++++++++++++++----
drivers/thermal/thermal_core.c | 4 +-
include/linux/cpu_cooling.h | 28 +++
include/linux/thermal.h | 4 +
4 files changed, 422 insertions(+), 44 deletions(-)
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index f3f4e6b5798e..8052cd1f4a39 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -24,10 +24,27 @@
#include <linux/thermal.h>
#include <linux/cpufreq.h>
#include <linux/err.h>
+#include <linux/pm_opp.h>
#include <linux/slab.h>
#include <linux/cpu.h>
#include <linux/cpu_cooling.h>
+/* Maximum number of opps we can track */
+#define MAX_NUM_OPPS 32
+
+/**
+ * struct power_table - frequency to power conversion
+ * @frequency: frequency in KHz
+ * @power: power in mW
+ *
+ * This structure is built when the cooling device registers and helps
+ * in translating frequency to power and viceversa.
+ */
+struct power_table {
+ u32 frequency;
+ u32 power;
+};
+
/**
* struct cpufreq_cooling_device - data for cooling device with cpufreq
* @id: unique integer value corresponding to each cpufreq_cooling_device
@@ -39,6 +56,14 @@
* @cpufreq_val: integer value representing the absolute value of the clipped
* frequency.
* @allowed_cpus: all the cpus involved for this cpufreq_cooling_device.
+ * @power_table: array of struct power_table for frequency to power conversion
+ * @power_table_entries: number of entries in the @power_table array
+ * @time_in_idle: previous reading of the absolute time that this cpu was idle
+ * @time_in_idle_timestamp: wall time of the last invocation of
+ * get_cpu_idle_time_us()
+ * @last_load: load measured by the latest call to cpufreq_get_cur()
+ * @freq: frequency in KHz of the cpus represented by the cooling device
+ * @capacitance: the dynamic power coefficient of these cpus
*
* This structure is required for keeping information of each
* cpufreq_cooling_device registered. In order to prevent corruption of this a
@@ -50,11 +75,19 @@ struct cpufreq_cooling_device {
unsigned int cpufreq_state;
unsigned int cpufreq_val;
struct cpumask allowed_cpus;
+ struct power_table power_table[MAX_NUM_OPPS];
+ int power_table_entries;
+ u64 time_in_idle[NR_CPUS];
+ u64 time_in_idle_timestamp[NR_CPUS];
+ u32 last_load;
+ u32 freq;
+ u32 capacitance;
};
static DEFINE_IDR(cpufreq_idr);
static DEFINE_MUTEX(cooling_cpufreq_lock);
static unsigned int cpufreq_dev_count;
+static bool power_actor_notifier_registered;
/* notify_table passes value to the CPUFREQ_ADJUST callback function. */
#define NOTIFY_INVALID NULL
@@ -115,6 +148,59 @@ static int is_cpufreq_valid(int cpu)
return !cpufreq_get_policy(&policy, cpu);
}
+static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_device,
+ u32 freq)
+{
+ int i;
+ struct power_table *pt = cpufreq_device->power_table;
+
+ for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
+ if (freq <= pt[i].frequency)
+ break;
+
+ return pt[i].power;
+}
+
+static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_device,
+ u32 power)
+{
+ int i;
+ struct power_table *pt = cpufreq_device->power_table;
+
+ for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
+ if (power <= pt[i].power)
+ break;
+
+ return pt[i].frequency;
+}
+
+/**
+ * get_load - get load for a cpu since last updated
+ * @cpufreq_device: struct cpufreq_cooling_device for this cooling device
+ * @cpu: cpu number
+ *
+ * Return the average load of cpu @cpu in percentage since this
+ * function was last called.
+ */
+static u32 get_load(struct cpufreq_cooling_device *cpufreq_device, int cpu)
+{
+ u32 load;
+ u64 now, now_idle, delta_time, delta_idle;
+
+ now_idle = get_cpu_idle_time(cpu, &now, 0);
+ delta_idle = now_idle - cpufreq_device->time_in_idle[cpu];
+ delta_time = now - cpufreq_device->time_in_idle_timestamp[cpu];
+
+ if (delta_time <= delta_idle)
+ load = 0;
+ else
+ load = div64_u64(100 * (delta_time - delta_idle), delta_time);
+
+ cpufreq_device->time_in_idle[cpu] = now_idle;
+ cpufreq_device->time_in_idle_timestamp[cpu] = now;
+ return load;
+}
+
enum cpufreq_cooling_property {
GET_LEVEL,
GET_FREQ,
@@ -342,98 +428,193 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
return 0;
}
+/**
+ * cpufreq_frequency_change - notifier callback for cpufreq frequency changes
+ * @nb: struct notifier_block * with callback info
+ * @event: value showing cpufreq event for which this function invoked
+ * @data: callback-specific data
+ *
+ * Callback to get notifications of frequency changes. In the
+ * CPUFREQ_POSTCHANGE @event we store the new frequency so that
+ * cpufreq_get_cur() knows the current frequency and can convert it
+ * into power.
+ */
+static int cpufreq_frequency_change(struct notifier_block *nb,
+ unsigned long event, void *data)
+{
+ struct thermal_cooling_device *cdev;
+ struct cpufreq_freqs *freqs = data;
+
+ /* Only update frequency on postchange */
+ if (event != CPUFREQ_POSTCHANGE)
+ return NOTIFY_DONE;
+
+ mutex_lock(&thermal_list_lock);
+
+ list_for_each_entry(cdev, &thermal_cdev_list, node) {
+ struct cpufreq_cooling_device *cpufreq_device;
+
+ if (strncmp(cdev->type, "thermal-cpufreq", THERMAL_NAME_LENGTH))
+ continue;
+
+ cpufreq_device = cdev->devdata;
+
+ if (cpumask_test_cpu(freqs->cpu, &cpufreq_device->allowed_cpus))
+ cpufreq_device->freq = freqs->new;
+ }
+
+ mutex_unlock(&thermal_list_lock);
+
+ return NOTIFY_OK;
+}
+
/* cpufreq cooling device callback functions are defined below */
/**
- * cpufreq_get_max_state - callback function to get the max cooling state.
+ * cpufreq_get_max - callback function to get the max cooling state/power.
* @cdev: thermal cooling device pointer.
- * @state: fill this variable with the max cooling state.
- * @unit: the units in which @state should be in.
+ * @val: fill this variable with the max cooling state/power.
+ * @unit: whether to put state or power in @val.
*
* Callback for the thermal cooling device to return the cpufreq max
- * cooling state. Currently only THERMAL_UNIT_STATE is valid for
- * @unit.
+ * cooling state/power. If @unit is THERMAL_UNIT_STATE, @val contains
+ * the maximum cooling state; if @unit is THERMAL_UNIT_POWER, @val
+ * maximum is power in milliwatts.
*
* Return: 0 on success, an error code otherwise.
*/
-static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
- unsigned long *state,
- enum thermal_cooling_unit unit)
+static int cpufreq_get_max(struct thermal_cooling_device *cdev,
+ unsigned long *val, enum thermal_cooling_unit unit)
{
struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
struct cpumask *mask = &cpufreq_device->allowed_cpus;
unsigned int cpu;
unsigned int count = 0;
- int ret;
-
- if (unit != THERMAL_UNIT_STATE)
- return -EINVAL;
+ int ret = 0;
cpu = cpumask_any(mask);
- ret = get_property(cpu, 0, &count, GET_MAXL);
+ if ((unit == THERMAL_UNIT_POWER) &&
+ (cpufreq_device->power_table_entries)) {
+ unsigned int num_cpus;
+ unsigned long max_freq;
+
+ max_freq = get_cpu_frequency(cpu, 0);
+ if (!max_freq)
+ return -EINVAL;
- if (count > 0)
- *state = count;
+ num_cpus = cpumask_weight(mask);
+ *val = cpu_freq_to_power(cdev->devdata, max_freq) * num_cpus;
+ } else if (unit == THERMAL_UNIT_STATE) {
+ ret = get_property(cpu, 0, &count, GET_MAXL);
+
+ if (count > 0)
+ *val = count;
+ } else {
+ ret = -EINVAL;
+ }
return ret;
}
/**
- * cpufreq_get_cur_state - callback function to get the current cooling state.
+ * cpufreq_get_cur - callback function to get the current cooling state/power.
* @cdev: thermal cooling device pointer.
- * @state: fill this variable with the current cooling state.
- * @unit: the units in which state should be in
+ * @val: fill this variable with the current cooling state/power.
+ * @unit: whether to put state or power in @val.
*
* Callback for the thermal cooling device to return the cpufreq
- * current cooling state. Currently only THERMAL_UNIT_STATE is valid
- * for @unit.
+ * current cooling state/power. If @unit is THERMAL_UNIT_STATE, @val
+ * contains the current cooling state; if @unit is THERMAL_UNIT_POWER,
+ * @val is current power consumption in milliwatts.
*
* Return: 0 on success, an error code otherwise.
*/
-static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
- unsigned long *state,
- enum thermal_cooling_unit unit)
+static int cpufreq_get_cur(struct thermal_cooling_device *cdev,
+ unsigned long *val, enum thermal_cooling_unit unit)
{
struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
- if (unit != THERMAL_UNIT_STATE)
- return -EINVAL;
+ if ((unit == THERMAL_UNIT_POWER) &&
+ (cpufreq_device->power_table_entries)) {
+ int cpu;
+ u32 total_load = 0, raw_cpu_power, power = 0;
+
+ raw_cpu_power = cpu_freq_to_power(cpufreq_device,
+ cpufreq_device->freq);
+
+ for_each_cpu(cpu, &cpufreq_device->allowed_cpus) {
+ u32 load;
- *state = cpufreq_device->cpufreq_state;
+ if (!cpu_online(cpu))
+ continue;
+
+ load = get_load(cpufreq_device, cpu);
+ power += (raw_cpu_power * load) / 100;
+ total_load += load;
+ }
+
+ cpufreq_device->last_load = total_load;
+ *val = power;
+ } else if (unit == THERMAL_UNIT_STATE) {
+ *val = cpufreq_device->cpufreq_state;
+ } else {
+ return -EINVAL;
+ }
return 0;
}
/**
- * cpufreq_set_cur_state - callback function to set the current cooling state.
+ * cpufreq_set_cur - callback function to set the current cooling state/power.
* @cdev: thermal cooling device pointer.
- * @state: set this variable to the current cooling state.
- * @unit: the units in which @state should be in.
+ * @val: set this variable to the current cooling state/power.
+ * @unit: whether @val is a cooling state or power.
*
* Callback for the thermal cooling device to change the cpufreq
- * current cooling state. Currently only THERMAL_UNIT_STATE is valid for
- * @unit.
+ * current cooling state/power. If @unit is THERMAL_UNIT_STATE, @val
+ * is a cooling state; if @unit is THERMAL_UNIT_POWER, @val is the target
+ * power consumption in milliwatts.
*
* Return: 0 on success, an error code otherwise.
*/
-static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
- unsigned long state,
- enum thermal_cooling_unit unit)
+static int cpufreq_set_cur(struct thermal_cooling_device *cdev,
+ unsigned long val, enum thermal_cooling_unit unit)
{
struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
-
- if (unit != THERMAL_UNIT_STATE)
+ unsigned long state;
+
+ if ((unit == THERMAL_UNIT_POWER) &&
+ (cpufreq_device->power_table_entries)) {
+ unsigned int cpu, freq;
+ u32 normalised_power, last_load;
+
+ cpu = cpumask_any(&cpufreq_device->allowed_cpus);
+ last_load = cpufreq_device->last_load ?
+ cpufreq_device->last_load : 1;
+ normalised_power = (val * 100) / last_load;
+ freq = cpu_power_to_freq(cpufreq_device, normalised_power);
+
+ state = cpufreq_cooling_get_level(cpu, freq);
+ if (state == THERMAL_CSTATE_INVALID) {
+ pr_err("Failed to convert %dKHz for cpu %d into a cdev state\n",
+ freq, cpu);
+ return -EINVAL;
+ }
+ } else if (unit == THERMAL_UNIT_STATE) {
+ state = val;
+ } else {
return -EINVAL;
+ }
return cpufreq_apply_cooling(cpufreq_device, state);
}
/* Bind cpufreq callbacks to thermal cooling device ops */
static struct thermal_cooling_device_ops const cpufreq_cooling_ops = {
- .get_max = cpufreq_get_max_state,
- .get_cur = cpufreq_get_cur_state,
- .set_cur = cpufreq_set_cur_state,
+ .get_max = cpufreq_get_max,
+ .get_cur = cpufreq_get_cur,
+ .set_cur = cpufreq_set_cur,
};
/* Notifier for cpufreq policy change */
@@ -441,10 +622,95 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
.notifier_call = cpufreq_thermal_notifier,
};
+/* Notifier for cpufreq frequency changes */
+struct notifier_block cpufreq_transition_notifier = {
+ .notifier_call = cpufreq_frequency_change,
+};
+
+/**
+ * build_cpu_power_table - create a power to frequency table
+ * @cpufreq_dev: the cpufreq_cooling_device in which to store the table
+ *
+ * Build a power to frequency table for this cpu and store it in
+ * @cpufreq_dev. This table will be used in cpu_power_to_freq() and
+ * cpu_freq_to_power() to convert between power and frequency
+ * efficiently. Power is stored in mW, frequency in KHz. The
+ * resulting table is in ascending order.
+ *
+ * Returns 0 on success, -E* on error.
+ */
+static int build_cpu_power_table(struct cpufreq_cooling_device *cpufreq_dev)
+{
+ struct dev_pm_opp *opp;
+ struct device *dev = NULL;
+ int num_opps, cpu, i, ret = 0;
+ unsigned long freq;
+
+ num_opps = 0;
+
+ rcu_read_lock();
+
+ for_each_cpu(cpu, &cpufreq_dev->allowed_cpus) {
+ dev = get_cpu_device(cpu);
+ if (!dev)
+ continue;
+
+ num_opps = dev_pm_opp_get_opp_count(dev);
+ if (num_opps > 0) {
+ break;
+ } else if (num_opps < 0) {
+ ret = num_opps;
+ goto unlock;
+ }
+ }
+
+ if ((num_opps == 0) || (num_opps > MAX_NUM_OPPS)) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ i = 0;
+ for (freq = 0;
+ opp = dev_pm_opp_find_freq_ceil(dev, &freq), !IS_ERR(opp);
+ freq++) {
+ u32 freq_mhz, voltage_mv;
+ u64 power;
+
+ freq_mhz = freq / 1000000;
+ voltage_mv = dev_pm_opp_get_voltage(opp) / 1000;
+
+ /*
+ * Do the multiplication with MHz and millivolt so as
+ * to not overflow.
+ */
+ power = (u64)cpufreq_dev->capacitance * freq_mhz *
+ voltage_mv * voltage_mv;
+ do_div(power, 1000000000);
+
+ /* frequency is stored in power_table in KHz */
+ cpufreq_dev->power_table[i].frequency = freq / 1000;
+ cpufreq_dev->power_table[i].power = power;
+
+ i++;
+ }
+
+ if (i == 0) {
+ ret = PTR_ERR(opp);
+ goto unlock;
+ }
+
+ cpufreq_dev->power_table_entries = i;
+
+unlock:
+ rcu_read_unlock();
+ return ret;
+}
+
/**
* __cpufreq_cooling_register - helper function to create cpufreq cooling device
* @np: a valid struct device_node to the cooling device device tree node
* @clip_cpus: cpumask of cpus where the frequency constraints will happen.
+ * @capacitance: dynamic power coefficient for these cpus
*
* This interface function registers the cpufreq cooling device with the name
* "thermal-cpufreq-%x". This api can support multiple instances of cpufreq
@@ -456,7 +722,7 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
*/
static struct thermal_cooling_device *
__cpufreq_cooling_register(struct device_node *np,
- const struct cpumask *clip_cpus)
+ const struct cpumask *clip_cpus, u32 capacitance)
{
struct thermal_cooling_device *cool_dev;
struct cpufreq_cooling_device *cpufreq_dev = NULL;
@@ -485,6 +751,7 @@ __cpufreq_cooling_register(struct device_node *np,
return ERR_PTR(-ENOMEM);
cpumask_copy(&cpufreq_dev->allowed_cpus, clip_cpus);
+ cpufreq_dev->capacitance = capacitance;
ret = get_idr(&cpufreq_idr, &cpufreq_dev->id);
if (ret) {
@@ -531,7 +798,7 @@ __cpufreq_cooling_register(struct device_node *np,
struct thermal_cooling_device *
cpufreq_cooling_register(const struct cpumask *clip_cpus)
{
- return __cpufreq_cooling_register(NULL, clip_cpus);
+ return __cpufreq_cooling_register(NULL, clip_cpus, 0);
}
EXPORT_SYMBOL_GPL(cpufreq_cooling_register);
@@ -555,11 +822,90 @@ of_cpufreq_cooling_register(struct device_node *np,
if (!np)
return ERR_PTR(-EINVAL);
- return __cpufreq_cooling_register(np, clip_cpus);
+ return __cpufreq_cooling_register(np, clip_cpus, 0);
}
EXPORT_SYMBOL_GPL(of_cpufreq_cooling_register);
/**
+ * of_cpufreq_power_actor_register - function to create a cpu cooling device with power capabilities
+ * @np: a valid struct device_node to the cooling device tree node
+ * @clip_cpus: cpumask of cpus where the frequency constraints will happen
+ * @capacitance: the dynamic power coefficient for these cpus
+ *
+ * Similar to of_cpufreq_cooling_register() this function registers a
+ * cooling device linked to the device tree node provided. It has a
+ * simple power model so that get_cur/get_max/set_cur can operate with
+ * THERMAL_UNIT_POWER
+ *
+ * The cpus must have registered their OPPs in the OPP library.
+ *
+ * Return: a valid struct thermal_cooling_device pointer on success,
+ * on failure, it returns a corresponding ERR_PTR().
+ */
+struct thermal_cooling_device *
+of_cpufreq_power_actor_register(struct device_node *np,
+ const struct cpumask *clip_cpus,
+ u32 capacitance)
+{
+ struct thermal_cooling_device *cdev, *err_ret;
+ int ret;
+
+ cdev = __cpufreq_cooling_register(np, clip_cpus, capacitance);
+ if (IS_ERR(cdev))
+ return cdev;
+
+ ret = build_cpu_power_table(cdev->devdata);
+ if (ret) {
+ err_ret = ERR_PTR(ret);
+ goto unregister;
+ }
+
+ /*
+ * You can't register multiple times the same notifier_block.
+ * The first power actor registered is the only one that
+ * registers the notifier.
+ */
+ if (!power_actor_notifier_registered) {
+ ret = cpufreq_register_notifier(&cpufreq_transition_notifier,
+ CPUFREQ_TRANSITION_NOTIFIER);
+ if (ret) {
+ err_ret = ERR_PTR(ret);
+ goto unregister;
+ }
+ power_actor_notifier_registered = true;
+ }
+
+ return cdev;
+
+unregister:
+ cpufreq_cooling_unregister(cdev);
+ return err_ret;
+}
+EXPORT_SYMBOL_GPL(of_cpufreq_power_actor_register);
+
+/**
+ * cpufreq_power_actor_register - function to create a cpu cooling device with power capabilities
+ * @clip_cpus: cpumask of cpus where the frequency constraints will happen
+ * @capacitance: the dynamic power coefficient for these cpus
+ *
+ * Similar to cpufreq_cooling_register() this function registers a
+ * cpufreq cooling device that has a simple power model so that
+ * get_cur/get_max/set_cur can operate with THERMAL_UNIT_POWER.
+ *
+ * See of_cpufreq_power_actor_register() for notes about the
+ * requisites for this to work.
+ *
+ * Return: a valid struct thermal_cooling_device pointer on success,
+ * on failure, it returns a corresponding ERR_PTR().
+ */
+struct thermal_cooling_device *
+cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance)
+{
+ return of_cpufreq_power_actor_register(NULL, clip_cpus, capacitance);
+}
+EXPORT_SYMBOL_GPL(cpufreq_power_actor_register);
+
+/**
* cpufreq_cooling_unregister - function to remove cpufreq cooling device.
* @cdev: thermal cooling device pointer.
*
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 5b46603cc7cd..1efaadf436fa 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -50,10 +50,10 @@ static DEFINE_IDR(thermal_cdev_idr);
static DEFINE_MUTEX(thermal_idr_lock);
static LIST_HEAD(thermal_tz_list);
-static LIST_HEAD(thermal_cdev_list);
+LIST_HEAD(thermal_cdev_list);
static LIST_HEAD(thermal_governor_list);
-static DEFINE_MUTEX(thermal_list_lock);
+DEFINE_MUTEX(thermal_list_lock);
static DEFINE_MUTEX(thermal_governor_lock);
static struct thermal_governor *def_governor;
diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
index c303d383def1..b8d92ac4266b 100644
--- a/include/linux/cpu_cooling.h
+++ b/include/linux/cpu_cooling.h
@@ -36,6 +36,9 @@
struct thermal_cooling_device *
cpufreq_cooling_register(const struct cpumask *clip_cpus);
+struct thermal_cooling_device *
+cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance);
+
/**
* of_cpufreq_cooling_register - create cpufreq cooling device based on DT.
* @np: a valid struct device_node to the cooling device device tree node.
@@ -45,6 +48,11 @@ cpufreq_cooling_register(const struct cpumask *clip_cpus);
struct thermal_cooling_device *
of_cpufreq_cooling_register(struct device_node *np,
const struct cpumask *clip_cpus);
+
+struct thermal_cooling_device *
+of_cpufreq_power_actor_register(struct device_node *np,
+ const struct cpumask *clip_cpus,
+ u32 capacitance);
#else
static inline struct thermal_cooling_device *
of_cpufreq_cooling_register(struct device_node *np,
@@ -52,6 +60,14 @@ of_cpufreq_cooling_register(struct device_node *np,
{
return NULL;
}
+
+static inline struct thermal_cooling_device *
+of_cpufreq_power_actor_register(struct device_node *np,
+ const struct cpumask *clip_cpus,
+ u32 capacitance);
+{
+ return ERR_PTR(-ENOSYS);
+}
#endif
/**
@@ -73,6 +89,18 @@ of_cpufreq_cooling_register(struct device_node *np,
{
return NULL;
}
+static inline struct thermal_cooling_device *
+cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance)
+{
+ return ERR_PTR(-ENOSYS);
+}
+static inline struct thermal_cooling_device *
+of_cpufreq_power_actor_register(struct device_node *np,
+ const struct cpumask *clip_cpus,
+ u32 capacitance)
+{
+ return ERR_PTR(-ENOSYS);
+}
static inline
void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
{
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index 8d183b8255eb..5986f546ca98 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -59,6 +59,9 @@
#define DEFAULT_THERMAL_GOVERNOR "user_space"
#endif
+extern struct list_head thermal_cdev_list;
+extern struct mutex thermal_list_lock;
+
struct thermal_zone_device;
struct thermal_cooling_device;
@@ -108,6 +111,7 @@ enum {
enum thermal_cooling_unit {
THERMAL_UNIT_STATE,
+ THERMAL_UNIT_POWER,
};
struct thermal_zone_device_ops {
--
1.7.9.5
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH 4/5] thermal: introduce the Power Allocator governor
2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
` (2 preceding siblings ...)
2014-05-06 12:06 ` [RFC PATCH 3/5] thermal: add a basic cpu power actor Javi Merino
@ 2014-05-06 12:06 ` Javi Merino
2014-05-13 23:20 ` Eduardo Valentin
2014-05-06 12:06 ` [RFC PATCH 5/5] thermal: add trace events to the power allocator governor Javi Merino
2014-05-14 10:05 ` [RFC PATCH 0/5] The power allocator thermal governor James King
5 siblings, 1 reply; 17+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
To: linux-pm
Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin,
Punit Agrawal
The power allocator governor is a thermal governor that controls system
and device power allocation to control temperature. Conceptually, the
implementation takes a system view of heat dissipation by managing
multiple heat sources.
This governor relies on power-aware cooling devices (power actors) to
operate. That is, cooling devices whose thermal_cooling_device_ops
accept THERMAL_UNIT_POWER.
It uses a Proportional Integral (PI) controller driven by the
temperature of the thermal zone. This budget is then allocated to
each cooling device that can have bearing on the temperature we are
trying to control. It decides how much power to give each cooling
device based on the performance they are requesting. The PI
controller ensures that the total power budget does not exceed the
control temperature.
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
drivers/thermal/Kconfig | 14 ++
drivers/thermal/Makefile | 1 +
drivers/thermal/power_allocator.c | 403 +++++++++++++++++++++++++++++++++++++
drivers/thermal/thermal_core.c | 7 +-
drivers/thermal/thermal_core.h | 8 +
include/linux/thermal.h | 5 +
6 files changed, 437 insertions(+), 1 deletion(-)
create mode 100644 drivers/thermal/power_allocator.c
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 2d51912a6e40..f5e6b693fc19 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
Select this if you want to let the user space manage the
platform thermals.
+config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
+ bool "power_allocator"
+ select THERMAL_GOV_POWER_ALLOCATOR
+ help
+ Select this if you want to control temperature based on
+ system and device power allocation. This governor relies on
+ power-aware cooling devices (power actors) to operate.
+
endchoice
config THERMAL_GOV_FAIR_SHARE
@@ -89,6 +97,12 @@ config THERMAL_GOV_USER_SPACE
help
Enable this to let the user space manage the platform thermals.
+config THERMAL_GOV_POWER_ALLOCATOR
+ bool "Power allocator thermal governor"
+ help
+ Enable this to manage platform thermals by dynamically
+ allocating and limiting power to devices.
+
config CPU_THERMAL
bool "generic cpu cooling support"
depends on CPU_FREQ
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 54e4ec9eb5df..1cb7d5fec8dc 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF) += of-thermal.o
thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE) += fair_share.o
thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE) += step_wise.o
thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE) += user_space.o
+thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR) += power_allocator.o
# cpufreq cooling
thermal_sys-$(CONFIG_CPU_THERMAL) += cpu_cooling.o
diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
new file mode 100644
index 000000000000..39c425abf10a
--- /dev/null
+++ b/drivers/thermal/power_allocator.c
@@ -0,0 +1,403 @@
+/*
+ * A power allocator to manage temperature
+ *
+ * Copyright (C) 2014 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) "Power allocator: " fmt
+
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/thermal.h>
+
+#include "thermal_core.h"
+
+#define MAX_NUM_ACTORS 8
+
+#define FRAC_BITS 8
+#define int_to_frac(x) ((x) << FRAC_BITS)
+#define frac_to_int(x) ((x) >> FRAC_BITS)
+
+/**
+ * mul_frac - multiply two fixed-point numbers
+ * @x: first multiplicand
+ * @y: second multiplicand
+ *
+ * Returns the result of multiplying two fixed-point numbers. The
+ * result is also a fixed-point number.
+ */
+static inline s64 mul_frac(s64 x, s64 y)
+{
+ return (x * y) >> FRAC_BITS;
+}
+
+enum power_allocator_trip_levels {
+ TRIP_SWITCH_ON = 0, /* Switch on PI controller */
+ TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
+};
+
+/**
+ * struct power_allocator_params - parameters for the power allocator governor
+ * @k_po: P parameter of the PI controller when overshooting (i.e., when
+ * temperature is below the target)
+ * @k_pi: P parameter of the PI controller when undershooting
+ * @k_i: I parameter of the PI controller
+ * @integral_cutoff: threshold below which the error is no longer accumulated
+ in the PI controller
+ * @err_integral: Accumulated error in the PI controller.
+ */
+struct power_allocator_params {
+ s32 k_po;
+ s32 k_pu;
+ s32 k_i;
+ s32 integral_cutoff;
+ s32 err_integral;
+};
+
+/**
+ * pi_controller() - PI controller
+ * @tz: thermal zone we are operating in
+ * @control_temp: The target temperature
+ * @max_allocatable_power: maximum allocatable power for this thermal zone
+ *
+ * This PI controller increases the available power budget so that the
+ * temperature of the thermal zone gets as close as possible to
+ * @control_temp and limits the power if it exceeds it. k_po is the
+ * proportional term when we are overshooting, k_pu is the
+ * proportional term when we are undershooting. integral_cutoff is a
+ * threshold below which we stop accumulating the error. The
+ * accumulated error is only valid if the requested power will make
+ * the system warmer. If the system is mostly idle, there's no point
+ * in accumulating positive error.
+ *
+ * It returns the power budget for the next period.
+ */
+static u32 pi_controller(struct thermal_zone_device *tz,
+ unsigned long current_temp, unsigned long control_temp,
+ unsigned long max_allocatable_power)
+{
+ s64 p, i, power_range;
+ s32 err;
+ struct power_allocator_params *params = tz->governor_data;
+
+ err = ((s32)control_temp - (s32)current_temp) / 1000;
+ err = int_to_frac(err);
+
+ /* Calculate the proportional term */
+ p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
+
+ /*
+ * Calculate the integral term
+ *
+ * if the error s less than cut off allow integration (but
+ * the integral is limited to max power)
+ */
+ i = mul_frac(params->k_i, params->err_integral);
+
+ if (err < int_to_frac(params->integral_cutoff)) {
+ s64 tmpi = mul_frac(params->k_i, err);
+ tmpi += i;
+ if (tmpi <= int_to_frac(max_allocatable_power)) {
+ i = tmpi;
+ params->err_integral += err;
+ }
+ }
+
+ power_range = p + i;
+
+ /* feed-forward the known maximum dissipatable power */
+ power_range = tz->tzp->max_dissipatable_power +
+ frac_to_int(power_range);
+
+ /* output should not exceed max power */
+ if (power_range > max_allocatable_power)
+ power_range = max_allocatable_power;
+
+ return power_range >= 0 ? power_range : 0;
+}
+
+static void divvy_up_power(struct thermal_zone_device *tz,
+ unsigned long req_power[MAX_NUM_ACTORS],
+ unsigned long max_power[MAX_NUM_ACTORS],
+ int num_actors, unsigned long total_req_power,
+ u32 power_range,
+ unsigned long granted_power[MAX_NUM_ACTORS])
+{
+ unsigned long extra_power, capped_extra_power;
+ unsigned long extra_actor_power[MAX_NUM_ACTORS];
+ int i;
+
+ if (!total_req_power) {
+ /*
+ * Nobody requested anything, so just give everybody
+ * the maximum power
+ */
+ for (i = 0; i < num_actors; i++)
+ granted_power[i] = max_power[i];
+
+ return;
+ }
+
+ capped_extra_power = 0;
+ extra_power = 0;
+ for (i = 0; i < num_actors; i++) {
+ u64 req_range = req_power[i] * power_range;
+
+ granted_power[i] = div_u64(req_range, total_req_power);
+
+ if (granted_power[i] > max_power[i]) {
+ extra_power += granted_power[i] - max_power[i];
+ granted_power[i] = max_power[i];
+ }
+
+ extra_actor_power[i] = max_power[i] - granted_power[i];
+ capped_extra_power += extra_actor_power[i];
+ }
+
+ if (!extra_power)
+ return;
+
+ /*
+ * Re-divvy the reclaimed extra among actors based on
+ * how far they are from the max
+ */
+ extra_power = min(extra_power, capped_extra_power);
+ if (capped_extra_power > 0)
+ for (i = 0; i < num_actors; i++)
+ granted_power[i] += (extra_actor_power[i] *
+ extra_power) / capped_extra_power;
+}
+
+static int allocate_power(struct thermal_zone_device *tz,
+ unsigned long current_temp, unsigned long control_temp)
+{
+ struct thermal_instance *instance;
+ unsigned long req_power[MAX_NUM_ACTORS], max_power[MAX_NUM_ACTORS];
+ unsigned long granted_power[MAX_NUM_ACTORS];
+ unsigned long total_req_power, max_allocatable_power;
+ u32 power_range;
+ int i, num_actors, ret = 0;
+
+ i = 0;
+ total_req_power = 0;
+ max_allocatable_power = 0;
+
+ mutex_lock(&tz->lock);
+
+ list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
+ struct thermal_cooling_device *cdev = instance->cdev;
+
+ if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
+ continue;
+
+ if (i >= MAX_NUM_ACTORS) {
+ pr_warn("Too many actors (%d) for this governor\n", i);
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ ret = cdev->ops->get_cur(cdev, &req_power[i],
+ THERMAL_UNIT_POWER);
+ if (ret) {
+ pr_warn_once("Couldn't get current power for %s: %d\n",
+ cdev->type, ret);
+ goto unlock;
+ }
+
+ total_req_power += req_power[i];
+
+ cdev->ops->get_max(cdev, &max_power[i], THERMAL_UNIT_POWER);
+ max_allocatable_power += max_power[i];
+
+ i++;
+ }
+ num_actors = i;
+
+ power_range = pi_controller(tz, current_temp, control_temp,
+ max_allocatable_power);
+
+ divvy_up_power(tz, req_power, max_power, num_actors, total_req_power,
+ power_range, granted_power);
+
+ i = 0;
+ list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
+ struct thermal_cooling_device *cdev = instance->cdev;
+
+ if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
+ continue;
+
+ cdev->ops->set_cur(cdev, granted_power[i], THERMAL_UNIT_POWER);
+ i++;
+ }
+
+unlock:
+ mutex_unlock(&tz->lock);
+
+ return ret;
+}
+
+static int check_trips(struct thermal_zone_device *tz)
+{
+ int ret;
+ enum thermal_trip_type type;
+
+ if (tz->trips < 2)
+ return -EINVAL;
+
+ ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
+ if (ret)
+ return ret;
+
+ if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
+ return -EINVAL;
+
+ ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
+ if (ret)
+ return ret;
+
+ if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
+ return -EINVAL;
+
+ return ret;
+}
+
+static void reset_pi_controller(struct power_allocator_params *params)
+{
+ params->err_integral = 0;
+}
+
+/**
+ * power_allocator_bind - bind the power_allocator governor to a thermal zone
+ * @tz: thermal zone to bind it to
+ *
+ * Check that the thermal zone is valid for this governor: has two
+ * thermal trips. If so, initialize the PI controller parameters and
+ * bind it to the thermal zone.
+ *
+ * Returns 0 on success, -EINVAL if the trips were invalid or -ENOMEM
+ * if we ran out of memory.
+ */
+static int power_allocator_bind(struct thermal_zone_device *tz)
+{
+ int ret;
+ struct power_allocator_params *params;
+ unsigned long switch_on_temp, control_temp;
+ u32 temperature_threshold;
+
+ ret = check_trips(tz);
+ if (ret) {
+ pr_err("thermal zone %s has the wrong number of trips for this governor\n",
+ tz->type);
+ return ret;
+ }
+
+ if (!tz->tzp || !tz->tzp->max_dissipatable_power) {
+ pr_err("Trying to bind the power_allocator governor to a thermal zone without the max_dissipatable_power parameter\n");
+ return -EINVAL;
+ }
+
+ params = kzalloc(sizeof(*params), GFP_KERNEL);
+ if (!params)
+ return -ENOMEM;
+
+ ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
+ if (ret)
+ goto free;
+
+ ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
+ &control_temp);
+ if (ret)
+ goto free;
+
+ temperature_threshold = (control_temp - switch_on_temp) / 1000;
+
+ params->k_po = int_to_frac(tz->tzp->max_dissipatable_power) /
+ temperature_threshold;
+ params->k_pu = int_to_frac(2 * tz->tzp->max_dissipatable_power) /
+ temperature_threshold;
+ params->k_i = int_to_frac(1);
+ params->integral_cutoff = 0;
+
+ reset_pi_controller(params);
+
+ tz->governor_data = params;
+ return 0;
+
+free:
+ kfree(params);
+ return ret;
+}
+
+static void power_allocator_unbind(struct thermal_zone_device *tz)
+{
+ pr_debug("Unbinding from thermal zone %d\n", tz->id);
+ kfree(tz->governor_data);
+ tz->governor_data = NULL;
+}
+
+static int power_allocator_throttle(struct thermal_zone_device *tz, int trip)
+{
+ int ret;
+ unsigned long switch_on_temp, control_temp, current_temp;
+ struct power_allocator_params *params = tz->governor_data;
+
+ /*
+ * We get called for every trip point but we only need to do
+ * our calculations once
+ */
+ if (trip != TRIP_MAX_DESIRED_TEMPERATURE)
+ return 0;
+
+ ret = thermal_zone_get_temp(tz, ¤t_temp);
+ if (ret) {
+ pr_warn("Failed to get temperature: %d\n", ret);
+ return ret;
+ }
+
+ ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
+ if (ret) {
+ pr_warn("Failed to get switch on temperature: %d\n", ret);
+ return ret;
+ }
+
+ if (current_temp < switch_on_temp) {
+ reset_pi_controller(params);
+ return 0;
+ }
+
+ ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
+ &control_temp);
+ if (ret) {
+ pr_warn("Failed to get the maximum desired temperature: %d\n",
+ ret);
+ return ret;
+ }
+
+ return allocate_power(tz, current_temp, control_temp);
+}
+
+static struct thermal_governor thermal_gov_power_allocator = {
+ .name = "power_allocator",
+ .bind_to_tz = power_allocator_bind,
+ .unbind_from_tz = power_allocator_unbind,
+ .throttle = power_allocator_throttle,
+};
+
+int thermal_gov_power_allocator_register(void)
+{
+ return thermal_register_governor(&thermal_gov_power_allocator);
+}
+
+void thermal_gov_power_allocator_unregister(void)
+{
+ thermal_unregister_governor(&thermal_gov_power_allocator);
+}
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 1efaadf436fa..bed3d2ef8ef3 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1840,7 +1840,11 @@ static int __init thermal_register_governors(void)
if (result)
return result;
- return thermal_gov_user_space_register();
+ result = thermal_gov_user_space_register();
+ if (result)
+ return result;
+
+ return thermal_gov_power_allocator_register();
}
static void thermal_unregister_governors(void)
@@ -1848,6 +1852,7 @@ static void thermal_unregister_governors(void)
thermal_gov_step_wise_unregister();
thermal_gov_fair_share_unregister();
thermal_gov_user_space_unregister();
+ thermal_gov_power_allocator_unregister();
}
static int __init thermal_init(void)
diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h
index 3db339fb636f..b24cde2c71cc 100644
--- a/drivers/thermal/thermal_core.h
+++ b/drivers/thermal/thermal_core.h
@@ -77,6 +77,14 @@ static inline int thermal_gov_user_space_register(void) { return 0; }
static inline void thermal_gov_user_space_unregister(void) {}
#endif /* CONFIG_THERMAL_GOV_USER_SPACE */
+#ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
+int thermal_gov_power_allocator_register(void);
+void thermal_gov_power_allocator_unregister(void);
+#else
+static inline int thermal_gov_power_allocator_register(void) { return 0; }
+static inline void thermal_gov_power_allocator_unregister(void) {}
+#endif /* CONFIG_THERMAL_GOV_POWER_ALLOCATOR */
+
/* device tree support */
#ifdef CONFIG_THERMAL_OF
int of_parse_thermal_zones(void);
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index 5986f546ca98..3861cb443520 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -57,6 +57,8 @@
#define DEFAULT_THERMAL_GOVERNOR "fair_share"
#elif defined(CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE)
#define DEFAULT_THERMAL_GOVERNOR "user_space"
+#elif defined(CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR)
+#define DEFAULT_THERMAL_GOVERNOR "power_allocator"
#endif
extern struct list_head thermal_cdev_list;
@@ -250,6 +252,9 @@ struct thermal_zone_params {
int num_tbps; /* Number of tbp entries */
struct thermal_bind_params *tbp;
+
+ /* Maximum power (heat) that this thermal zone can dissipate in mW */
+ u32 max_dissipatable_power;
};
struct thermal_genl_event {
--
1.7.9.5
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH 5/5] thermal: add trace events to the power allocator governor
2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
` (3 preceding siblings ...)
2014-05-06 12:06 ` [RFC PATCH 4/5] thermal: introduce the Power Allocator governor Javi Merino
@ 2014-05-06 12:06 ` Javi Merino
2014-05-06 12:28 ` Steven Rostedt
2014-05-14 10:05 ` [RFC PATCH 0/5] The power allocator thermal governor James King
5 siblings, 1 reply; 17+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
To: linux-pm
Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin,
Steven Rostedt, Frederic Weisbecker, Ingo Molnar
Add trace events for the power allocator governor and the power actor
interface of the cpu cooling device.
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
Hi trace maintainers,
We would like to add cpumask to a trace point. We currently allocate
a string and print the mask there with cpumask_scnprintf() which is a
bit heavyweight. Will it be possible to add this capability to ftrace
so that we can add the cpumask to the trace buffer and then trace-cmd
prints the mask when reporting?
drivers/thermal/cpu_cooling.c | 9 ++++++++
drivers/thermal/power_allocator.c | 12 ++++++++++-
include/trace/events/thermal.h | 34 ++++++++++++++++++++++++++++++
include/trace/events/thermal_governor.h | 35 +++++++++++++++++++++++++++++++
4 files changed, 89 insertions(+), 1 deletion(-)
create mode 100644 include/trace/events/thermal.h
create mode 100644 include/trace/events/thermal_governor.h
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 8052cd1f4a39..3520e7a6b5c3 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -29,6 +29,9 @@
#include <linux/cpu.h>
#include <linux/cpu_cooling.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/thermal.h>
+
/* Maximum number of opps we can track */
#define MAX_NUM_OPPS 32
@@ -565,6 +568,8 @@ static int cpufreq_get_cur(struct thermal_cooling_device *cdev,
return 0;
}
+static char trace_cpus_str[DIV_ROUND_UP(NR_CPUS, 4) + 1];
+
/**
* cpufreq_set_cur - callback function to set the current cooling state/power.
* @cdev: thermal cooling device pointer.
@@ -601,6 +606,10 @@ static int cpufreq_set_cur(struct thermal_cooling_device *cdev,
freq, cpu);
return -EINVAL;
}
+
+ cpumask_scnprintf(trace_cpus_str, ARRAY_SIZE(trace_cpus_str),
+ &cpufreq_device->allowed_cpus);
+ trace_thermal_power_limit(trace_cpus_str, freq, state, val);
} else if (unit == THERMAL_UNIT_STATE) {
state = val;
} else {
diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
index 39c425abf10a..cde1fa1bd58a 100644
--- a/drivers/thermal/power_allocator.c
+++ b/drivers/thermal/power_allocator.c
@@ -19,6 +19,9 @@
#include <linux/slab.h>
#include <linux/thermal.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/thermal_governor.h>
+
#include "thermal_core.h"
#define MAX_NUM_ACTORS 8
@@ -122,7 +125,14 @@ static u32 pi_controller(struct thermal_zone_device *tz,
if (power_range > max_allocatable_power)
power_range = max_allocatable_power;
- return power_range >= 0 ? power_range : 0;
+ power_range = power_range >= 0 ? power_range : 0;
+
+ trace_thermal_power_allocator_pi(frac_to_int(err),
+ frac_to_int(params->err_integral),
+ frac_to_int(p), frac_to_int(i),
+ power_range);
+
+ return power_range;
}
static void divvy_up_power(struct thermal_zone_device *tz,
diff --git a/include/trace/events/thermal.h b/include/trace/events/thermal.h
new file mode 100644
index 000000000000..9e2c2c65e81b
--- /dev/null
+++ b/include/trace/events/thermal.h
@@ -0,0 +1,34 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM thermal
+
+#if !defined(_TRACE_THERMAL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_THERMAL_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(thermal_power_limit,
+ TP_PROTO(const char *cpus, unsigned int freq, unsigned long cdev_state,
+ unsigned long power),
+ TP_ARGS(cpus, freq, cdev_state, power),
+ TP_STRUCT__entry(
+ __string(cpus, cpus)
+ __field(unsigned int, freq )
+ __field(unsigned long, cdev_state)
+ __field(unsigned long, power )
+ ),
+ TP_fast_assign(
+ __assign_str(cpus, cpus);
+ __entry->freq = freq;
+ __entry->cdev_state = cdev_state;
+ __entry->power = power;
+ ),
+
+ TP_printk("cpus=%s freq=%u cdev_state=%lu power=%lu",
+ __get_str(cpus), __entry->freq, __entry->cdev_state,
+ __entry->power)
+);
+
+#endif /* _TRACE_THERMAL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/trace/events/thermal_governor.h b/include/trace/events/thermal_governor.h
new file mode 100644
index 000000000000..1fbf5c91f659
--- /dev/null
+++ b/include/trace/events/thermal_governor.h
@@ -0,0 +1,35 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM thermal_governor
+
+#if !defined(_TRACE_THERMAL_GOVERNOR_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_THERMAL_GOVERNOR_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(thermal_power_allocator_pi,
+ TP_PROTO(s32 err, s32 err_integral, s64 p, s64 i, s32 output),
+ TP_ARGS(err, err_integral, p, i, output),
+ TP_STRUCT__entry(
+ __field(s32, err )
+ __field(s32, err_integral)
+ __field(s64, p )
+ __field(s64, i )
+ __field(s32, output )
+ ),
+ TP_fast_assign(
+ __entry->err = err;
+ __entry->err_integral = err_integral;
+ __entry->p = p;
+ __entry->i = i;
+ __entry->output = output;
+ ),
+
+ TP_printk("err=%d err_integral=%d p=%lld i=%lld output=%d",
+ __entry->err, __entry->err_integral,
+ __entry->p, __entry->i, __entry->output)
+);
+
+#endif /* _TRACE_THERMAL_GOVERNOR_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--
1.7.9.5
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 5/5] thermal: add trace events to the power allocator governor
2014-05-06 12:06 ` [RFC PATCH 5/5] thermal: add trace events to the power allocator governor Javi Merino
@ 2014-05-06 12:28 ` Steven Rostedt
0 siblings, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2014-05-06 12:28 UTC (permalink / raw)
To: Javi Merino
Cc: linux-pm, Punit.Agrawal, Zhang Rui, Eduardo Valentin,
Frederic Weisbecker, Ingo Molnar
On Tue, 6 May 2014 13:06:38 +0100
Javi Merino <javi.merino@arm.com> wrote:
> Hi trace maintainers,
Hi Javi,
>
> We would like to add cpumask to a trace point. We currently allocate
> a string and print the mask there with cpumask_scnprintf() which is a
> bit heavyweight. Will it be possible to add this capability to ftrace
> so that we can add the cpumask to the trace buffer and then trace-cmd
> prints the mask when reporting?
Hmm, that shouldn't be too hard. I would suggest making it work like a
dynamic array, as we don't want to store more bits than necessary.
I'll see if I can whip something up.
-- Steve
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone
2014-05-06 12:06 ` [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone Javi Merino
@ 2014-05-13 14:31 ` edubezval
2014-05-15 14:30 ` Javi Merino
0 siblings, 1 reply; 17+ messages in thread
From: edubezval @ 2014-05-13 14:31 UTC (permalink / raw)
To: Javi Merino; +Cc: linux-pm, Punit Agrawal, Zhang Rui
Hello Javi,
Thanks for your patch. Find some comment inline.
On Tue, May 06, 2014 at 01:06:34PM +0100, Javi Merino wrote:
> A governor may need to store its current state between calls to
> throttle(). That state depends on the thermal zone, so store it as
> private data in struct thermal_zone_device.
>
> The governors may have two new ops: bind_to_tz() and unbind_from_tz().
> When provided, these functions let governors do some initialization
> and teardown when they are bound/unbound to a tz and possibly store that
> information in the governor_data field of the struct
> thermal_zone_device.
Overall the idea is fine to me, although so far we have no case of such
need in the current governors. I will have a look on your proposed
governor to check futher on this API.
>
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
> drivers/thermal/thermal_core.c | 67 +++++++++++++++++++++++++++++++++++-----
> include/linux/thermal.h | 3 ++
> 2 files changed, 62 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 71b0ec0c370d..ac8e5990a63e 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -72,6 +72,45 @@ static struct thermal_governor *__find_governor(const char *name)
> return NULL;
> }
>
> +/**
> + * thermal_set_governor() - Switch to another governor
> + * @tz: a valid pointer to a struct thermal_zone_device
> + * @new_gov: pointer to the new governor
> + *
> + * Change the governor of thermal zone @tz.
> + *
> + * Returns 0 on success, an error if the new governor's bind_to_tz() failed.
> + */
> +static int thermal_set_governor(struct thermal_zone_device *tz,
> + struct thermal_governor *new_gov)
> +{
> + int ret = 0;
> +
> + if (tz->governor && tz->governor->unbind_from_tz)
> + tz->governor->unbind_from_tz(tz);
> +
> + if (new_gov && new_gov->bind_to_tz) {
> + ret = new_gov->bind_to_tz(tz);
> + if (ret) {
> + /* Try to register the old governor again */
I suggest to state in the Documentation that a governor may be called
to bind and unbind to same thermal zone several times, as per above code.
> + if (tz->governor && tz->governor->bind_to_tz) {
> + if (tz->governor->bind_to_tz(tz))
> + /*
> + * The new governor failed to register
> + * and the previous one failed as well
> + */
I suggest having a dev_warn here to notify that a zone is now orphan due to
switching from one governor to another.
> + tz->governor = NULL;
> + }
> +
> + return ret;
> + }
> + }
> +
> + tz->governor = new_gov;
> +
> + return ret;
> +}
> +
> int thermal_register_governor(struct thermal_governor *governor)
> {
> int err;
> @@ -104,8 +143,12 @@ int thermal_register_governor(struct thermal_governor *governor)
>
> name = pos->tzp->governor_name;
>
> - if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
> - pos->governor = governor;
> + if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
> + int ret = thermal_set_governor(pos, governor);
nip: add a blank line between declarations and first statement.
> + if (ret)
> + pr_warn("Failed to set governor %s for zone %d: %d\n",
> + governor->name, pos->id, ret);
dev_* family of logging functions is preferred in this file.
> + }
> }
>
> mutex_unlock(&thermal_list_lock);
> @@ -131,7 +174,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
> list_for_each_entry(pos, &thermal_tz_list, node) {
> if (!strnicmp(pos->governor->name, governor->name,
> THERMAL_NAME_LENGTH))
> - pos->governor = NULL;
> + thermal_set_governor(pos, NULL);
> }
>
> mutex_unlock(&thermal_list_lock);
> @@ -756,8 +799,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
> if (!gov)
> goto exit;
>
> - tz->governor = gov;
> - ret = count;
> + ret = thermal_set_governor(tz, gov);
> + if (!ret)
> + ret = count;
>
> exit:
> mutex_unlock(&thermal_governor_lock);
> @@ -1452,6 +1496,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> int result;
> int count;
> int passive = 0;
> + struct thermal_governor *governor;
>
> if (type && strlen(type) >= THERMAL_NAME_LENGTH)
> return ERR_PTR(-EINVAL);
> @@ -1542,9 +1587,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> mutex_lock(&thermal_governor_lock);
>
> if (tz->tzp)
> - tz->governor = __find_governor(tz->tzp->governor_name);
> + governor = __find_governor(tz->tzp->governor_name);
> else
> - tz->governor = def_governor;
> + governor = def_governor;
> +
> + result = thermal_set_governor(tz, governor);
> + if (result) {
> + mutex_unlock(&thermal_governor_lock);
> + goto unregister;
> + }
>
> mutex_unlock(&thermal_governor_lock);
>
> @@ -1634,7 +1685,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
> device_remove_file(&tz->device, &dev_attr_mode);
> device_remove_file(&tz->device, &dev_attr_policy);
> remove_trip_attrs(tz);
> - tz->governor = NULL;
> + thermal_set_governor(tz, NULL);
>
> thermal_remove_hwmon_sysfs(tz);
> release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index f7e11c7ea7d9..baac212f815e 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -177,6 +177,7 @@ struct thermal_zone_device {
> struct thermal_zone_device_ops *ops;
> const struct thermal_zone_params *tzp;
> struct thermal_governor *governor;
> + void *governor_data;
Please document this entry.
> struct list_head thermal_instances;
> struct idr idr;
> struct mutex lock; /* protect thermal_instances list */
> @@ -187,6 +188,8 @@ struct thermal_zone_device {
> /* Structure that holds thermal governor information */
> struct thermal_governor {
> char name[THERMAL_NAME_LENGTH];
> + int (*bind_to_tz)(struct thermal_zone_device *tz);
> + void (*unbind_from_tz)(struct thermal_zone_device *tz);
a governor cannot prevent unbinding?
Please, document the proposed operations.
> int (*throttle)(struct thermal_zone_device *tz, int trip);
> struct list_head governor_list;
> };
> --
> 1.7.9.5
>
>
I request that you update Documentation/thermal/sysfs-api.txt too.
BR,
Eduardo Valentin,
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state
2014-05-06 12:06 ` [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state Javi Merino
@ 2014-05-13 15:22 ` edubezval
0 siblings, 0 replies; 17+ messages in thread
From: edubezval @ 2014-05-13 15:22 UTC (permalink / raw)
To: Javi Merino; +Cc: linux-pm, Punit Agrawal, Zhang Rui
Hello Javi,
On Tue, May 06, 2014 at 01:06:35PM +0100, Javi Merino wrote:
> Currently, cooling devices operate on state, even though some (like
> the cpufreq cooling device) have exported functions to convert
> frequency to state. Generalize this interface so that governors can
> operate on other units (e.g. frequency, power,...).
>
> Suggested-by: Eduardo Valentin <edubezval@gmail.com>
> Cc: Zhang Rui <rui.zhang@intel.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
> drivers/acpi/fan.c | 22 ++++++++++++++------
> drivers/acpi/processor_thermal.c | 23 +++++++++++++++------
> drivers/acpi/video.c | 22 ++++++++++++++------
> drivers/platform/x86/acerhdf.c | 24 ++++++++++++++++------
> drivers/platform/x86/intel_menlow.c | 26 +++++++++++++++++-------
> drivers/power/power_supply_core.c | 24 ++++++++++++++++------
> drivers/thermal/cpu_cooling.c | 38 ++++++++++++++++++++++++++---------
> drivers/thermal/db8500_thermal.c | 2 +-
> drivers/thermal/fair_share.c | 2 +-
> drivers/thermal/intel_powerclamp.c | 24 ++++++++++++++++------
> drivers/thermal/step_wise.c | 2 +-
> drivers/thermal/thermal_core.c | 13 ++++++------
> include/linux/thermal.h | 13 +++++++++---
> 13 files changed, 169 insertions(+), 66 deletions(-)
>
This change must be reflected under Documentation/thermal/sysfs-api.txt
too.
> diff --git a/drivers/acpi/fan.c b/drivers/acpi/fan.c
> index 8acf53e62966..55346deb8202 100644
> --- a/drivers/acpi/fan.c
> +++ b/drivers/acpi/fan.c
> @@ -79,20 +79,26 @@ static struct acpi_driver acpi_fan_driver = {
>
> /* thermal cooling device callbacks */
> static int fan_get_max_state(struct thermal_cooling_device *cdev, unsigned long
> - *state)
> + *state, enum thermal_cooling_unit unit)
> {
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> /* ACPI fan device only support two states: ON/OFF */
> *state = 1;
> return 0;
> }
>
> static int fan_get_cur_state(struct thermal_cooling_device *cdev, unsigned long
> - *state)
> + *state, enum thermal_cooling_unit unit)
> {
> struct acpi_device *device = cdev->devdata;
> int result;
> int acpi_state = ACPI_STATE_D0;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> if (!device)
> return -EINVAL;
>
> @@ -106,11 +112,15 @@ static int fan_get_cur_state(struct thermal_cooling_device *cdev, unsigned long
> }
>
> static int
> -fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state)
> +fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state,
> + enum thermal_cooling_unit unit)
> {
> struct acpi_device *device = cdev->devdata;
> int result;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> if (!device || (state != 0 && state != 1))
> return -EINVAL;
>
> @@ -121,9 +131,9 @@ fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state)
> }
>
> static const struct thermal_cooling_device_ops fan_cooling_ops = {
> - .get_max_state = fan_get_max_state,
> - .get_cur_state = fan_get_cur_state,
> - .set_cur_state = fan_set_cur_state,
> + .get_max = fan_get_max_state,
> + .get_cur = fan_get_cur_state,
> + .set_cur = fan_set_cur_state,
> };
>
> /* --------------------------------------------------------------------------
> diff --git a/drivers/acpi/processor_thermal.c b/drivers/acpi/processor_thermal.c
> index e003663b2f8e..48179ed9afe5 100644
> --- a/drivers/acpi/processor_thermal.c
> +++ b/drivers/acpi/processor_thermal.c
> @@ -200,11 +200,15 @@ static int acpi_processor_max_state(struct acpi_processor *pr)
> }
> static int
> processor_get_max_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
> {
> struct acpi_device *device = cdev->devdata;
> struct acpi_processor *pr;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> if (!device)
> return -EINVAL;
>
> @@ -218,11 +222,15 @@ processor_get_max_state(struct thermal_cooling_device *cdev,
>
> static int
> processor_get_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long *cur_state)
> + unsigned long *cur_state,
> + enum thermal_cooling_unit unit)
> {
> struct acpi_device *device = cdev->devdata;
> struct acpi_processor *pr;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> if (!device)
> return -EINVAL;
>
> @@ -238,13 +246,16 @@ processor_get_cur_state(struct thermal_cooling_device *cdev,
>
> static int
> processor_set_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long state)
> + unsigned long state, enum thermal_cooling_unit unit)
> {
> struct acpi_device *device = cdev->devdata;
> struct acpi_processor *pr;
> int result = 0;
> int max_pstate;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> if (!device)
> return -EINVAL;
>
> @@ -270,7 +281,7 @@ processor_set_cur_state(struct thermal_cooling_device *cdev,
> }
>
> const struct thermal_cooling_device_ops processor_cooling_ops = {
> - .get_max_state = processor_get_max_state,
> - .get_cur_state = processor_get_cur_state,
> - .set_cur_state = processor_set_cur_state,
> + .get_max = processor_get_max_state,
> + .get_cur = processor_get_cur_state,
> + .set_cur = processor_set_cur_state,
> };
> diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
> index 8b6990e417ec..f0aeb4d1ba91 100644
> --- a/drivers/acpi/video.c
> +++ b/drivers/acpi/video.c
> @@ -282,23 +282,29 @@ static const struct backlight_ops acpi_backlight_ops = {
>
> /* thermal cooling device callbacks */
> static int video_get_max_state(struct thermal_cooling_device *cooling_dev, unsigned
> - long *state)
> + long *state, enum thermal_cooling_unit unit)
> {
> struct acpi_device *device = cooling_dev->devdata;
> struct acpi_video_device *video = acpi_driver_data(device);
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> *state = video->brightness->count - 3;
> return 0;
> }
>
> static int video_get_cur_state(struct thermal_cooling_device *cooling_dev, unsigned
> - long *state)
> + long *state, enum thermal_cooling_unit unit)
> {
> struct acpi_device *device = cooling_dev->devdata;
> struct acpi_video_device *video = acpi_driver_data(device);
> unsigned long long level;
> int offset;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> if (acpi_video_device_lcd_get_level_current(video, &level, false))
> return -EINVAL;
> for (offset = 2; offset < video->brightness->count; offset++)
> @@ -311,12 +317,16 @@ static int video_get_cur_state(struct thermal_cooling_device *cooling_dev, unsig
> }
>
> static int
> -video_set_cur_state(struct thermal_cooling_device *cooling_dev, unsigned long state)
> +video_set_cur_state(struct thermal_cooling_device *cooling_dev,
> + unsigned long state, enum thermal_cooling_unit unit)
> {
> struct acpi_device *device = cooling_dev->devdata;
> struct acpi_video_device *video = acpi_driver_data(device);
> int level;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> if (state >= video->brightness->count - 2)
> return -EINVAL;
>
> @@ -326,9 +336,9 @@ video_set_cur_state(struct thermal_cooling_device *cooling_dev, unsigned long st
> }
>
> static const struct thermal_cooling_device_ops video_cooling_ops = {
> - .get_max_state = video_get_max_state,
> - .get_cur_state = video_get_cur_state,
> - .set_cur_state = video_set_cur_state,
> + .get_max = video_get_max_state,
> + .get_cur = video_get_cur_state,
> + .set_cur = video_set_cur_state,
> };
>
> /*
> diff --git a/drivers/platform/x86/acerhdf.c b/drivers/platform/x86/acerhdf.c
> index f94467c05225..0bb85e240d93 100644
> --- a/drivers/platform/x86/acerhdf.c
> +++ b/drivers/platform/x86/acerhdf.c
> @@ -439,18 +439,26 @@ static struct thermal_zone_device_ops acerhdf_dev_ops = {
> * get maximal fan cooling state
> */
> static int acerhdf_get_max_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
> {
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> *state = 1;
>
> return 0;
> }
>
> static int acerhdf_get_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
> {
> int err = 0, tmp;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> err = acerhdf_get_fanstate(&tmp);
> if (err)
> return err;
> @@ -461,10 +469,14 @@ static int acerhdf_get_cur_state(struct thermal_cooling_device *cdev,
>
> /* change current fan state - is overwritten when running in kernel mode */
> static int acerhdf_set_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long state)
> + unsigned long state,
> + enum thermal_cooling_unit unit)
> {
> int cur_temp, cur_state, err = 0;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> if (!kernelmode)
> return 0;
>
> @@ -498,9 +510,9 @@ err_out:
>
> /* bind fan callbacks to fan device */
> static struct thermal_cooling_device_ops acerhdf_cooling_ops = {
> - .get_max_state = acerhdf_get_max_state,
> - .get_cur_state = acerhdf_get_cur_state,
> - .set_cur_state = acerhdf_set_cur_state,
> + .get_max = acerhdf_get_max_state,
> + .get_cur = acerhdf_get_cur_state,
> + .set_cur = acerhdf_set_cur_state,
> };
>
> /* suspend / resume functionality */
> diff --git a/drivers/platform/x86/intel_menlow.c b/drivers/platform/x86/intel_menlow.c
> index e8b46d2c468c..9792da407c46 100644
> --- a/drivers/platform/x86/intel_menlow.c
> +++ b/drivers/platform/x86/intel_menlow.c
> @@ -61,7 +61,8 @@ static void intel_menlow_unregister_sensor(void);
> * GTHS returning '0' would mean that no bandwidth control states are supported
> */
> static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
> - unsigned long *max_state)
> + unsigned long *max_state,
> + enum thermal_cooling_unit unit)
> {
> struct acpi_device *device = cdev->devdata;
> acpi_handle handle = device->handle;
> @@ -70,6 +71,9 @@ static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
> union acpi_object arg;
> acpi_status status = AE_OK;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> arg_list.count = 1;
> arg_list.pointer = &arg;
> arg.type = ACPI_TYPE_INTEGER;
> @@ -87,7 +91,8 @@ static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
> }
>
> static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
> - unsigned long *value)
> + unsigned long *value,
> + enum thermal_cooling_unit unit)
> {
> struct acpi_device *device = cdev->devdata;
> acpi_handle handle = device->handle;
> @@ -96,6 +101,9 @@ static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
> union acpi_object arg;
> acpi_status status = AE_OK;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> arg_list.count = 1;
> arg_list.pointer = &arg;
> arg.type = ACPI_TYPE_INTEGER;
> @@ -110,7 +118,8 @@ static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
> }
>
> static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
> - unsigned long state)
> + unsigned long state,
> + enum thermal_cooling_unit unit)
> {
> struct acpi_device *device = cdev->devdata;
> acpi_handle handle = device->handle;
> @@ -120,7 +129,10 @@ static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
> unsigned long long temp;
> unsigned long max_state;
>
> - if (memory_get_max_bandwidth(cdev, &max_state))
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> + if (memory_get_max_bandwidth(cdev, &max_state, THERMAL_UNIT_STATE))
> return -EFAULT;
>
> if (state > max_state)
> @@ -143,9 +155,9 @@ static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
> }
>
> static struct thermal_cooling_device_ops memory_cooling_ops = {
> - .get_max_state = memory_get_max_bandwidth,
> - .get_cur_state = memory_get_cur_bandwidth,
> - .set_cur_state = memory_set_cur_bandwidth,
> + .get_max = memory_get_max_bandwidth,
> + .get_cur = memory_get_cur_bandwidth,
> + .set_cur = memory_set_cur_bandwidth,
> };
>
> /*
> diff --git a/drivers/power/power_supply_core.c b/drivers/power/power_supply_core.c
> index 26606641fe44..69dc2901a788 100644
> --- a/drivers/power/power_supply_core.c
> +++ b/drivers/power/power_supply_core.c
> @@ -440,12 +440,16 @@ static void psy_unregister_thermal(struct power_supply *psy)
>
> /* thermal cooling device callbacks */
> static int ps_get_max_charge_cntl_limit(struct thermal_cooling_device *tcd,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
> {
> struct power_supply *psy;
> union power_supply_propval val;
> int ret;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> psy = tcd->devdata;
> ret = psy->get_property(psy,
> POWER_SUPPLY_PROP_CHARGE_CONTROL_LIMIT_MAX, &val);
> @@ -456,12 +460,16 @@ static int ps_get_max_charge_cntl_limit(struct thermal_cooling_device *tcd,
> }
>
> static int ps_get_cur_chrage_cntl_limit(struct thermal_cooling_device *tcd,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
> {
> struct power_supply *psy;
> union power_supply_propval val;
> int ret;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> psy = tcd->devdata;
> ret = psy->get_property(psy,
> POWER_SUPPLY_PROP_CHARGE_CONTROL_LIMIT, &val);
> @@ -472,12 +480,16 @@ static int ps_get_cur_chrage_cntl_limit(struct thermal_cooling_device *tcd,
> }
>
> static int ps_set_cur_charge_cntl_limit(struct thermal_cooling_device *tcd,
> - unsigned long state)
> + unsigned long state,
> + enum thermal_cooling_unit unit)
> {
> struct power_supply *psy;
> union power_supply_propval val;
> int ret;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> psy = tcd->devdata;
> val.intval = state;
> ret = psy->set_property(psy,
> @@ -487,9 +499,9 @@ static int ps_set_cur_charge_cntl_limit(struct thermal_cooling_device *tcd,
> }
>
> static struct thermal_cooling_device_ops psy_tcd_ops = {
> - .get_max_state = ps_get_max_charge_cntl_limit,
> - .get_cur_state = ps_get_cur_chrage_cntl_limit,
> - .set_cur_state = ps_set_cur_charge_cntl_limit,
> + .get_max = ps_get_max_charge_cntl_limit,
> + .get_cur = ps_get_cur_chrage_cntl_limit,
> + .set_cur = ps_set_cur_charge_cntl_limit,
> };
>
> static int psy_register_cooler(struct power_supply *psy)
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 4246262c4bd2..f3f4e6b5798e 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -348,14 +348,17 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
> * cpufreq_get_max_state - callback function to get the max cooling state.
> * @cdev: thermal cooling device pointer.
> * @state: fill this variable with the max cooling state.
> + * @unit: the units in which @state should be in.
> *
> - * Callback for the thermal cooling device to return the cpufreq
> - * max cooling state.
> + * Callback for the thermal cooling device to return the cpufreq max
> + * cooling state. Currently only THERMAL_UNIT_STATE is valid for
> + * @unit.
> *
> * Return: 0 on success, an error code otherwise.
> */
> static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
> {
> struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
> struct cpumask *mask = &cpufreq_device->allowed_cpus;
> @@ -363,6 +366,9 @@ static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
> unsigned int count = 0;
> int ret;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> cpu = cpumask_any(mask);
>
> ret = get_property(cpu, 0, &count, GET_MAXL);
> @@ -377,17 +383,23 @@ static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
> * cpufreq_get_cur_state - callback function to get the current cooling state.
> * @cdev: thermal cooling device pointer.
> * @state: fill this variable with the current cooling state.
> + * @unit: the units in which state should be in
> *
> * Callback for the thermal cooling device to return the cpufreq
> - * current cooling state.
> + * current cooling state. Currently only THERMAL_UNIT_STATE is valid
> + * for @unit.
> *
> * Return: 0 on success, an error code otherwise.
> */
> static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
> {
> struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> *state = cpufreq_device->cpufreq_state;
>
> return 0;
> @@ -397,25 +409,31 @@ static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
> * cpufreq_set_cur_state - callback function to set the current cooling state.
> * @cdev: thermal cooling device pointer.
> * @state: set this variable to the current cooling state.
> + * @unit: the units in which @state should be in.
> *
> * Callback for the thermal cooling device to change the cpufreq
> - * current cooling state.
> + * current cooling state. Currently only THERMAL_UNIT_STATE is valid for
> + * @unit.
> *
> * Return: 0 on success, an error code otherwise.
> */
> static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long state)
> + unsigned long state,
> + enum thermal_cooling_unit unit)
> {
> struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> return cpufreq_apply_cooling(cpufreq_device, state);
> }
>
> /* Bind cpufreq callbacks to thermal cooling device ops */
> static struct thermal_cooling_device_ops const cpufreq_cooling_ops = {
> - .get_max_state = cpufreq_get_max_state,
> - .get_cur_state = cpufreq_get_cur_state,
> - .set_cur_state = cpufreq_set_cur_state,
> + .get_max = cpufreq_get_max_state,
> + .get_cur = cpufreq_get_cur_state,
> + .set_cur = cpufreq_set_cur_state,
> };
>
> /* Notifier for cpufreq policy change */
> diff --git a/drivers/thermal/db8500_thermal.c b/drivers/thermal/db8500_thermal.c
> index 1e3b3bf9f993..0842a2ef88b4 100644
> --- a/drivers/thermal/db8500_thermal.c
> +++ b/drivers/thermal/db8500_thermal.c
> @@ -67,7 +67,7 @@ static int db8500_cdev_bind(struct thermal_zone_device *thermal,
> unsigned long max_state, upper, lower;
> int i, ret = -EINVAL;
>
> - cdev->ops->get_max_state(cdev, &max_state);
> + cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
>
> for (i = 0; i < ptrips->num_trips; i++) {
> if (db8500_thermal_match_cdev(cdev, &ptrips->trip_points[i]))
> diff --git a/drivers/thermal/fair_share.c b/drivers/thermal/fair_share.c
> index 944ba2f340c8..8e6eb4883bb6 100644
> --- a/drivers/thermal/fair_share.c
> +++ b/drivers/thermal/fair_share.c
> @@ -51,7 +51,7 @@ static long get_target_state(struct thermal_zone_device *tz,
> {
> unsigned long max_state;
>
> - cdev->ops->get_max_state(cdev, &max_state);
> + cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
>
> return (long)(weight * level * max_state) / (100 * tz->trips);
> }
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index a084325f1386..c32bee16b17a 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -616,16 +616,24 @@ static struct notifier_block powerclamp_cpu_notifier = {
> };
>
> static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
> {
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> *state = MAX_TARGET_RATIO;
>
> return 0;
> }
>
> static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
> {
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> if (true == clamping)
> *state = pkg_cstate_ratio_cur;
> else
> @@ -636,10 +644,14 @@ static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
> }
>
> static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long new_target_ratio)
> + unsigned long new_target_ratio,
> + enum thermal_cooling_unit unit)
> {
> int ret = 0;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> new_target_ratio = clamp(new_target_ratio, 0UL,
> (unsigned long) (MAX_TARGET_RATIO-1));
> if (set_target_ratio == 0 && new_target_ratio > 0) {
> @@ -663,9 +675,9 @@ exit_set:
>
> /* bind to generic thermal layer as cooling device*/
> static struct thermal_cooling_device_ops powerclamp_cooling_ops = {
> - .get_max_state = powerclamp_get_max_state,
> - .get_cur_state = powerclamp_get_cur_state,
> - .set_cur_state = powerclamp_set_cur_state,
> + .get_max = powerclamp_get_max_state,
> + .get_cur = powerclamp_get_cur_state,
> + .set_cur = powerclamp_set_cur_state,
> };
>
> /* runs on Nehalem and later */
> diff --git a/drivers/thermal/step_wise.c b/drivers/thermal/step_wise.c
> index f251521baaa2..2a2cd4eba3c3 100644
> --- a/drivers/thermal/step_wise.c
> +++ b/drivers/thermal/step_wise.c
> @@ -58,7 +58,7 @@ static unsigned long get_target_state(struct thermal_instance *instance,
> * Otherwise, we use the current state of the
> * cdev in use to determine the next_target.
> */
> - cdev->ops->get_cur_state(cdev, &cur_state);
> + cdev->ops->get_cur(cdev, &cur_state, THERMAL_UNIT_STATE);
> next_target = instance->target;
> dev_dbg(&cdev->device, "cur_state=%ld\n", cur_state);
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index ac8e5990a63e..5b46603cc7cd 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -871,7 +871,7 @@ thermal_cooling_device_max_state_show(struct device *dev,
> unsigned long state;
> int ret;
>
> - ret = cdev->ops->get_max_state(cdev, &state);
> + ret = cdev->ops->get_max(cdev, &state, THERMAL_UNIT_STATE);
> if (ret)
> return ret;
> return sprintf(buf, "%ld\n", state);
> @@ -885,7 +885,7 @@ thermal_cooling_device_cur_state_show(struct device *dev,
> unsigned long state;
> int ret;
>
> - ret = cdev->ops->get_cur_state(cdev, &state);
> + ret = cdev->ops->get_cur(cdev, &state, THERMAL_UNIT_STATE);
> if (ret)
> return ret;
> return sprintf(buf, "%ld\n", state);
> @@ -906,7 +906,7 @@ thermal_cooling_device_cur_state_store(struct device *dev,
> if ((long)state < 0)
> return -EINVAL;
>
> - result = cdev->ops->set_cur_state(cdev, state);
> + result = cdev->ops->set_cur(cdev, state, THERMAL_UNIT_STATE);
> if (result)
> return result;
> return count;
> @@ -983,7 +983,7 @@ int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz,
> if (tz != pos1 || cdev != pos2)
> return -EINVAL;
>
> - cdev->ops->get_max_state(cdev, &max_state);
> + cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
>
> /* lower default 0, upper default max_state */
> lower = lower == THERMAL_NO_LIMIT ? 0 : lower;
> @@ -1143,8 +1143,7 @@ __thermal_cooling_device_register(struct device_node *np,
> if (type && strlen(type) >= THERMAL_NAME_LENGTH)
> return ERR_PTR(-EINVAL);
>
> - if (!ops || !ops->get_max_state || !ops->get_cur_state ||
> - !ops->set_cur_state)
> + if (!ops || !ops->get_max || !ops->get_cur || !ops->set_cur)
> return ERR_PTR(-EINVAL);
>
> cdev = kzalloc(sizeof(struct thermal_cooling_device), GFP_KERNEL);
> @@ -1329,7 +1328,7 @@ void thermal_cdev_update(struct thermal_cooling_device *cdev)
> target = instance->target;
> }
> mutex_unlock(&cdev->lock);
> - cdev->ops->set_cur_state(cdev, target);
> + cdev->ops->set_cur(cdev, target, THERMAL_UNIT_STATE);
> cdev->updated = true;
> dev_dbg(&cdev->device, "set to state %lu\n", target);
> }
I believe the sysfs API needs to be refactored too. Here we would need
to have a new entry 'unit' to represent what unit the cooling device
supports.
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index baac212f815e..8d183b8255eb 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -106,6 +106,10 @@ enum {
> };
> #define THERMAL_GENL_CMD_MAX (__THERMAL_GENL_CMD_MAX - 1)
>
> +enum thermal_cooling_unit {
> + THERMAL_UNIT_STATE,
> +};
> +
> struct thermal_zone_device_ops {
> int (*bind) (struct thermal_zone_device *,
> struct thermal_cooling_device *);
> @@ -135,9 +139,12 @@ struct thermal_zone_device_ops {
> };
>
> struct thermal_cooling_device_ops {
> - int (*get_max_state) (struct thermal_cooling_device *, unsigned long *);
> - int (*get_cur_state) (struct thermal_cooling_device *, unsigned long *);
> - int (*set_cur_state) (struct thermal_cooling_device *, unsigned long);
> + int (*get_max)(struct thermal_cooling_device *, unsigned long *,
> + enum thermal_cooling_unit);
> + int (*get_cur)(struct thermal_cooling_device *, unsigned long *,
> + enum thermal_cooling_unit);
> + int (*set_cur)(struct thermal_cooling_device *, unsigned long,
> + enum thermal_cooling_unit);
> };
Now, reading this patch again, I realize that this API may be improved.
How about doing a handshake between the cooling device and the governor
instead of sending the required unit on every message
That is, the thermal core would ask the cooling device its supported
unit and the governor supported unit, then the bind would be called
only if they match units. I am assuming we are limiting to one supported
unit per cooling device / governor, right?
>
> struct thermal_cooling_device {
> --
> 1.7.9.5
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 3/5] thermal: add a basic cpu power actor
2014-05-06 12:06 ` [RFC PATCH 3/5] thermal: add a basic cpu power actor Javi Merino
@ 2014-05-13 22:57 ` Eduardo Valentin
2014-05-15 17:02 ` Javi Merino
0 siblings, 1 reply; 17+ messages in thread
From: Eduardo Valentin @ 2014-05-13 22:57 UTC (permalink / raw)
To: Javi Merino; +Cc: linux-pm, Punit.Agrawal, Zhang Rui
Hello Javi,
On Tue, May 06, 2014 at 01:06:36PM +0100, Javi Merino wrote:
> Introduce a new thermal_cooling_unit: THERMAL_UNIT_POWER and let
> cpu_cooling device operate using it as well as "states". This allows
> governors that call cpufreq_get_max/cpufreq_get_cur/cpufreq_set_cur()
> with THERMAL_UNIT_POWER to use cpufreq cooling devices.
>
> If the cpu cooling device is registered using
> {of_,}cpufreq_power_actor_register() instead of
> {of_,}cpufreq_cooling_register() it uses the current frequency (as
> reported by cpufreq) as well as load and OPPs for the power
> calculations. The cpus must have registered their OPPs in the OPP
> library.
>
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
> drivers/thermal/cpu_cooling.c | 430 ++++++++++++++++++++++++++++++++++++----
> drivers/thermal/thermal_core.c | 4 +-
> include/linux/cpu_cooling.h | 28 +++
> include/linux/thermal.h | 4 +
> 4 files changed, 422 insertions(+), 44 deletions(-)
Could you please also document under Documentation/thermal/sysfs-api.txt the exposed APIs ?
>
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index f3f4e6b5798e..8052cd1f4a39 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -24,10 +24,27 @@
> #include <linux/thermal.h>
> #include <linux/cpufreq.h>
> #include <linux/err.h>
> +#include <linux/pm_opp.h>
> #include <linux/slab.h>
> #include <linux/cpu.h>
> #include <linux/cpu_cooling.h>
>
> +/* Maximum number of opps we can track */
> +#define MAX_NUM_OPPS 32
> +
> +/**
> + * struct power_table - frequency to power conversion
> + * @frequency: frequency in KHz
> + * @power: power in mW
> + *
> + * This structure is built when the cooling device registers and helps
> + * in translating frequency to power and viceversa.
> + */
> +struct power_table {
> + u32 frequency;
> + u32 power;
> +};
> +
> /**
> * struct cpufreq_cooling_device - data for cooling device with cpufreq
> * @id: unique integer value corresponding to each cpufreq_cooling_device
> @@ -39,6 +56,14 @@
> * @cpufreq_val: integer value representing the absolute value of the clipped
> * frequency.
> * @allowed_cpus: all the cpus involved for this cpufreq_cooling_device.
> + * @power_table: array of struct power_table for frequency to power conversion
> + * @power_table_entries: number of entries in the @power_table array
> + * @time_in_idle: previous reading of the absolute time that this cpu was idle
> + * @time_in_idle_timestamp: wall time of the last invocation of
> + * get_cpu_idle_time_us()
> + * @last_load: load measured by the latest call to cpufreq_get_cur()
> + * @freq: frequency in KHz of the cpus represented by the cooling device
> + * @capacitance: the dynamic power coefficient of these cpus
> *
> * This structure is required for keeping information of each
> * cpufreq_cooling_device registered. In order to prevent corruption of this a
> @@ -50,11 +75,19 @@ struct cpufreq_cooling_device {
> unsigned int cpufreq_state;
> unsigned int cpufreq_val;
> struct cpumask allowed_cpus;
> + struct power_table power_table[MAX_NUM_OPPS];
Can we build this table dynamically?
I believe we should be able to fetch this table from opp layer. Besides,
should not be trouble of this code to estimate the max number of opps.
> + int power_table_entries;
> + u64 time_in_idle[NR_CPUS];
> + u64 time_in_idle_timestamp[NR_CPUS];
> + u32 last_load;
> + u32 freq;
> + u32 capacitance;
> };
> static DEFINE_IDR(cpufreq_idr);
> static DEFINE_MUTEX(cooling_cpufreq_lock);
>
> static unsigned int cpufreq_dev_count;
> +static bool power_actor_notifier_registered;
Are you sure you need this bool? Shouldn' t you be using a counter?
You probably need locking on this.
>
> /* notify_table passes value to the CPUFREQ_ADJUST callback function. */
> #define NOTIFY_INVALID NULL
> @@ -115,6 +148,59 @@ static int is_cpufreq_valid(int cpu)
> return !cpufreq_get_policy(&policy, cpu);
> }
>
> +static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_device,
> + u32 freq)
> +{
> + int i;
> + struct power_table *pt = cpufreq_device->power_table;
> +
> + for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
> + if (freq <= pt[i].frequency)
> + break;
> +
> + return pt[i].power;
> +}
> +
> +static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_device,
> + u32 power)
> +{
> + int i;
> + struct power_table *pt = cpufreq_device->power_table;
> +
> + for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
> + if (power <= pt[i].power)
> + break;
> +
> + return pt[i].frequency;
> +}
> +
> +/**
> + * get_load - get load for a cpu since last updated
> + * @cpufreq_device: struct cpufreq_cooling_device for this cooling device
> + * @cpu: cpu number
> + *
> + * Return the average load of cpu @cpu in percentage since this
> + * function was last called.
> + */
> +static u32 get_load(struct cpufreq_cooling_device *cpufreq_device, int cpu)
> +{
> + u32 load;
> + u64 now, now_idle, delta_time, delta_idle;
> +
> + now_idle = get_cpu_idle_time(cpu, &now, 0);
> + delta_idle = now_idle - cpufreq_device->time_in_idle[cpu];
> + delta_time = now - cpufreq_device->time_in_idle_timestamp[cpu];
> +
> + if (delta_time <= delta_idle)
> + load = 0;
> + else
> + load = div64_u64(100 * (delta_time - delta_idle), delta_time);
> +
> + cpufreq_device->time_in_idle[cpu] = now_idle;
> + cpufreq_device->time_in_idle_timestamp[cpu] = now;
nip: blank line here.
> + return load;
> +}
> +
> enum cpufreq_cooling_property {
> GET_LEVEL,
> GET_FREQ,
> @@ -342,98 +428,193 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
> return 0;
> }
>
> +/**
> + * cpufreq_frequency_change - notifier callback for cpufreq frequency changes
> + * @nb: struct notifier_block * with callback info
> + * @event: value showing cpufreq event for which this function invoked
> + * @data: callback-specific data
> + *
> + * Callback to get notifications of frequency changes. In the
> + * CPUFREQ_POSTCHANGE @event we store the new frequency so that
> + * cpufreq_get_cur() knows the current frequency and can convert it
> + * into power.
> + */
> +static int cpufreq_frequency_change(struct notifier_block *nb,
> + unsigned long event, void *data)
> +{
> + struct thermal_cooling_device *cdev;
> + struct cpufreq_freqs *freqs = data;
> +
> + /* Only update frequency on postchange */
> + if (event != CPUFREQ_POSTCHANGE)
> + return NOTIFY_DONE;
> +
> + mutex_lock(&thermal_list_lock);
> +
> + list_for_each_entry(cdev, &thermal_cdev_list, node) {
> + struct cpufreq_cooling_device *cpufreq_device;
> +
> + if (strncmp(cdev->type, "thermal-cpufreq", THERMAL_NAME_LENGTH))
> + continue;
> +
I am not sure I follow why you need to go through the list of cooling devices
here. Can you please elaborate a bit more?
> + cpufreq_device = cdev->devdata;
> +
> + if (cpumask_test_cpu(freqs->cpu, &cpufreq_device->allowed_cpus))
> + cpufreq_device->freq = freqs->new;
> + }
> +
> + mutex_unlock(&thermal_list_lock);
> +
> + return NOTIFY_OK;
> +}
> +
> /* cpufreq cooling device callback functions are defined below */
>
> /**
> - * cpufreq_get_max_state - callback function to get the max cooling state.
> + * cpufreq_get_max - callback function to get the max cooling state/power.
> * @cdev: thermal cooling device pointer.
> - * @state: fill this variable with the max cooling state.
> - * @unit: the units in which @state should be in.
> + * @val: fill this variable with the max cooling state/power.
> + * @unit: whether to put state or power in @val.
> *
> * Callback for the thermal cooling device to return the cpufreq max
> - * cooling state. Currently only THERMAL_UNIT_STATE is valid for
> - * @unit.
> + * cooling state/power. If @unit is THERMAL_UNIT_STATE, @val contains
> + * the maximum cooling state; if @unit is THERMAL_UNIT_POWER, @val
> + * maximum is power in milliwatts.
> *
> * Return: 0 on success, an error code otherwise.
> */
> -static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
> - unsigned long *state,
> - enum thermal_cooling_unit unit)
> +static int cpufreq_get_max(struct thermal_cooling_device *cdev,
> + unsigned long *val, enum thermal_cooling_unit unit)
Shouldn't this change be part of previous patch?
> {
> struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
> struct cpumask *mask = &cpufreq_device->allowed_cpus;
> unsigned int cpu;
> unsigned int count = 0;
> - int ret;
> -
> - if (unit != THERMAL_UNIT_STATE)
> - return -EINVAL;
What happens to this code if tomorrow we add yet another unit?
> + int ret = 0;
>
> cpu = cpumask_any(mask);
>
> - ret = get_property(cpu, 0, &count, GET_MAXL);
> + if ((unit == THERMAL_UNIT_POWER) &&
> + (cpufreq_device->power_table_entries)) {
I think a cleaner code would use a switch case statement. You should
deal with the power_table_entries restriction separatedly.
> + unsigned int num_cpus;
> + unsigned long max_freq;
> +
> + max_freq = get_cpu_frequency(cpu, 0);
> + if (!max_freq)
> + return -EINVAL;
>
> - if (count > 0)
> - *state = count;
> + num_cpus = cpumask_weight(mask);
> + *val = cpu_freq_to_power(cdev->devdata, max_freq) * num_cpus;
> + } else if (unit == THERMAL_UNIT_STATE) {
> + ret = get_property(cpu, 0, &count, GET_MAXL);
> +
> + if (count > 0)
> + *val = count;
> + } else {
> + ret = -EINVAL;
> + }
>
> return ret;
> }
>
> /**
> - * cpufreq_get_cur_state - callback function to get the current cooling state.
> + * cpufreq_get_cur - callback function to get the current cooling state/power.
> * @cdev: thermal cooling device pointer.
> - * @state: fill this variable with the current cooling state.
> - * @unit: the units in which state should be in
> + * @val: fill this variable with the current cooling state/power.
> + * @unit: whether to put state or power in @val.
> *
> * Callback for the thermal cooling device to return the cpufreq
> - * current cooling state. Currently only THERMAL_UNIT_STATE is valid
> - * for @unit.
> + * current cooling state/power. If @unit is THERMAL_UNIT_STATE, @val
> + * contains the current cooling state; if @unit is THERMAL_UNIT_POWER,
> + * @val is current power consumption in milliwatts.
> *
> * Return: 0 on success, an error code otherwise.
> */
> -static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long *state,
> - enum thermal_cooling_unit unit)
> +static int cpufreq_get_cur(struct thermal_cooling_device *cdev,
> + unsigned long *val, enum thermal_cooling_unit unit)
> {
> struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
>
> - if (unit != THERMAL_UNIT_STATE)
> - return -EINVAL;
> + if ((unit == THERMAL_UNIT_POWER) &&
> + (cpufreq_device->power_table_entries)) {
ditto.
> + int cpu;
> + u32 total_load = 0, raw_cpu_power, power = 0;
> +
> + raw_cpu_power = cpu_freq_to_power(cpufreq_device,
> + cpufreq_device->freq);
> +
> + for_each_cpu(cpu, &cpufreq_device->allowed_cpus) {
> + u32 load;
>
> - *state = cpufreq_device->cpufreq_state;
> + if (!cpu_online(cpu))
> + continue;
> +
> + load = get_load(cpufreq_device, cpu);
> + power += (raw_cpu_power * load) / 100;
> + total_load += load;
> + }
> +
> + cpufreq_device->last_load = total_load;
> + *val = power;
> + } else if (unit == THERMAL_UNIT_STATE) {
> + *val = cpufreq_device->cpufreq_state;
> + } else {
> + return -EINVAL;
> + }
>
> return 0;
> }
>
> /**
> - * cpufreq_set_cur_state - callback function to set the current cooling state.
> + * cpufreq_set_cur - callback function to set the current cooling state/power.
> * @cdev: thermal cooling device pointer.
> - * @state: set this variable to the current cooling state.
> - * @unit: the units in which @state should be in.
> + * @val: set this variable to the current cooling state/power.
> + * @unit: whether @val is a cooling state or power.
> *
> * Callback for the thermal cooling device to change the cpufreq
> - * current cooling state. Currently only THERMAL_UNIT_STATE is valid for
> - * @unit.
> + * current cooling state/power. If @unit is THERMAL_UNIT_STATE, @val
> + * is a cooling state; if @unit is THERMAL_UNIT_POWER, @val is the target
> + * power consumption in milliwatts.
> *
> * Return: 0 on success, an error code otherwise.
> */
> -static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long state,
> - enum thermal_cooling_unit unit)
> +static int cpufreq_set_cur(struct thermal_cooling_device *cdev,
> + unsigned long val, enum thermal_cooling_unit unit)
> {
> struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
> -
> - if (unit != THERMAL_UNIT_STATE)
> + unsigned long state;
> +
> + if ((unit == THERMAL_UNIT_POWER) &&
> + (cpufreq_device->power_table_entries)) {
ditto.
> + unsigned int cpu, freq;
> + u32 normalised_power, last_load;
> +
> + cpu = cpumask_any(&cpufreq_device->allowed_cpus);
> + last_load = cpufreq_device->last_load ?
> + cpufreq_device->last_load : 1;
> + normalised_power = (val * 100) / last_load;
> + freq = cpu_power_to_freq(cpufreq_device, normalised_power);
> +
> + state = cpufreq_cooling_get_level(cpu, freq);
> + if (state == THERMAL_CSTATE_INVALID) {
> + pr_err("Failed to convert %dKHz for cpu %d into a cdev state\n",
> + freq, cpu);
> + return -EINVAL;
> + }
> + } else if (unit == THERMAL_UNIT_STATE) {
> + state = val;
> + } else {
> return -EINVAL;
> + }
>
> return cpufreq_apply_cooling(cpufreq_device, state);
> }
>
> /* Bind cpufreq callbacks to thermal cooling device ops */
> static struct thermal_cooling_device_ops const cpufreq_cooling_ops = {
> - .get_max = cpufreq_get_max_state,
> - .get_cur = cpufreq_get_cur_state,
> - .set_cur = cpufreq_set_cur_state,
> + .get_max = cpufreq_get_max,
> + .get_cur = cpufreq_get_cur,
> + .set_cur = cpufreq_set_cur,
> };
>
> /* Notifier for cpufreq policy change */
> @@ -441,10 +622,95 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
> .notifier_call = cpufreq_thermal_notifier,
> };
>
> +/* Notifier for cpufreq frequency changes */
> +struct notifier_block cpufreq_transition_notifier = {
> + .notifier_call = cpufreq_frequency_change,
> +};
> +
> +/**
> + * build_cpu_power_table - create a power to frequency table
> + * @cpufreq_dev: the cpufreq_cooling_device in which to store the table
> + *
> + * Build a power to frequency table for this cpu and store it in
> + * @cpufreq_dev. This table will be used in cpu_power_to_freq() and
> + * cpu_freq_to_power() to convert between power and frequency
> + * efficiently. Power is stored in mW, frequency in KHz. The
> + * resulting table is in ascending order.
> + *
> + * Returns 0 on success, -E* on error.
> + */
> +static int build_cpu_power_table(struct cpufreq_cooling_device *cpufreq_dev)
> +{
> + struct dev_pm_opp *opp;
> + struct device *dev = NULL;
> + int num_opps, cpu, i, ret = 0;
> + unsigned long freq;
> +
> + num_opps = 0;
> +
> + rcu_read_lock();
> +
> + for_each_cpu(cpu, &cpufreq_dev->allowed_cpus) {
> + dev = get_cpu_device(cpu);
> + if (!dev)
> + continue;
> +
> + num_opps = dev_pm_opp_get_opp_count(dev);
> + if (num_opps > 0) {
> + break;
> + } else if (num_opps < 0) {
> + ret = num_opps;
> + goto unlock;
> + }
> + }
> +
> + if ((num_opps == 0) || (num_opps > MAX_NUM_OPPS)) {
hmm.. Does it mean that if an existing platform has more than 32
opps, we won't support it? I would preffer we dynamically allocate
the table here and cover any number of opps.
> + ret = -EINVAL;
> + goto unlock;
> + }
> +
> + i = 0;
> + for (freq = 0;
> + opp = dev_pm_opp_find_freq_ceil(dev, &freq), !IS_ERR(opp);
> + freq++) {
> + u32 freq_mhz, voltage_mv;
> + u64 power;
> +
> + freq_mhz = freq / 1000000;
> + voltage_mv = dev_pm_opp_get_voltage(opp) / 1000;
> +
> + /*
> + * Do the multiplication with MHz and millivolt so as
> + * to not overflow.
> + */
> + power = (u64)cpufreq_dev->capacitance * freq_mhz *
> + voltage_mv * voltage_mv;
Essentially: P = C x F x V^2 right? Your model covers for dynamic power
only.
I believe this power model neglets leakage. Static power consumption
cannot be negleted nowadays, specially on thermal constrained CPUs.
(which happens to be the target of this subsystem).
My suggestion is that you model power as:
P = P_{dyn} + P_{static}
P_{dyn} can be what you propose above, for instance, or some other
better increasing convex function. But we cannot neglect P_{static}.
I suggest allowing platform device driver writer to input
power models here somehow. Either by using callbacks to
implement their own power model, or by feeding the API with
their own table.
> + do_div(power, 1000000000);
> +
> + /* frequency is stored in power_table in KHz */
> + cpufreq_dev->power_table[i].frequency = freq / 1000;
> + cpufreq_dev->power_table[i].power = power;
> +
> + i++;
> + }
> +
> + if (i == 0) {
> + ret = PTR_ERR(opp);
> + goto unlock;
> + }
> +
> + cpufreq_dev->power_table_entries = i;
> +
> +unlock:
> + rcu_read_unlock();
> + return ret;
> +}
> +
> /**
> * __cpufreq_cooling_register - helper function to create cpufreq cooling device
> * @np: a valid struct device_node to the cooling device device tree node
> * @clip_cpus: cpumask of cpus where the frequency constraints will happen.
> + * @capacitance: dynamic power coefficient for these cpus
> *
> * This interface function registers the cpufreq cooling device with the name
> * "thermal-cpufreq-%x". This api can support multiple instances of cpufreq
> @@ -456,7 +722,7 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
> */
> static struct thermal_cooling_device *
> __cpufreq_cooling_register(struct device_node *np,
> - const struct cpumask *clip_cpus)
> + const struct cpumask *clip_cpus, u32 capacitance)
> {
> struct thermal_cooling_device *cool_dev;
> struct cpufreq_cooling_device *cpufreq_dev = NULL;
> @@ -485,6 +751,7 @@ __cpufreq_cooling_register(struct device_node *np,
> return ERR_PTR(-ENOMEM);
>
> cpumask_copy(&cpufreq_dev->allowed_cpus, clip_cpus);
> + cpufreq_dev->capacitance = capacitance;
>
> ret = get_idr(&cpufreq_idr, &cpufreq_dev->id);
> if (ret) {
> @@ -531,7 +798,7 @@ __cpufreq_cooling_register(struct device_node *np,
> struct thermal_cooling_device *
> cpufreq_cooling_register(const struct cpumask *clip_cpus)
> {
> - return __cpufreq_cooling_register(NULL, clip_cpus);
> + return __cpufreq_cooling_register(NULL, clip_cpus, 0);
> }
> EXPORT_SYMBOL_GPL(cpufreq_cooling_register);
>
> @@ -555,11 +822,90 @@ of_cpufreq_cooling_register(struct device_node *np,
> if (!np)
> return ERR_PTR(-EINVAL);
>
> - return __cpufreq_cooling_register(np, clip_cpus);
> + return __cpufreq_cooling_register(np, clip_cpus, 0);
> }
> EXPORT_SYMBOL_GPL(of_cpufreq_cooling_register);
>
> /**
> + * of_cpufreq_power_actor_register - function to create a cpu cooling device with power capabilities
> + * @np: a valid struct device_node to the cooling device tree node
> + * @clip_cpus: cpumask of cpus where the frequency constraints will happen
> + * @capacitance: the dynamic power coefficient for these cpus
> + *
> + * Similar to of_cpufreq_cooling_register() this function registers a
> + * cooling device linked to the device tree node provided. It has a
> + * simple power model so that get_cur/get_max/set_cur can operate with
> + * THERMAL_UNIT_POWER
> + *
> + * The cpus must have registered their OPPs in the OPP library.
> + *
> + * Return: a valid struct thermal_cooling_device pointer on success,
> + * on failure, it returns a corresponding ERR_PTR().
> + */
> +struct thermal_cooling_device *
> +of_cpufreq_power_actor_register(struct device_node *np,
> + const struct cpumask *clip_cpus,
> + u32 capacitance)
> +{
> + struct thermal_cooling_device *cdev, *err_ret;
> + int ret;
> +
> + cdev = __cpufreq_cooling_register(np, clip_cpus, capacitance);
> + if (IS_ERR(cdev))
> + return cdev;
> +
> + ret = build_cpu_power_table(cdev->devdata);
> + if (ret) {
> + err_ret = ERR_PTR(ret);
> + goto unregister;
> + }
> +
> + /*
> + * You can't register multiple times the same notifier_block.
> + * The first power actor registered is the only one that
> + * registers the notifier.
> + */
> + if (!power_actor_notifier_registered) {
> + ret = cpufreq_register_notifier(&cpufreq_transition_notifier,
> + CPUFREQ_TRANSITION_NOTIFIER);
> + if (ret) {
> + err_ret = ERR_PTR(ret);
> + goto unregister;
> + }
> + power_actor_notifier_registered = true;
I believe you need locking to update this global variable.
> + }
> +
> + return cdev;
> +
> +unregister:
> + cpufreq_cooling_unregister(cdev);
> + return err_ret;
> +}
> +EXPORT_SYMBOL_GPL(of_cpufreq_power_actor_register);
This needs documentation under Documentation/thermal/sysfs-api.txt.
> +
> +/**
> + * cpufreq_power_actor_register - function to create a cpu cooling device with power capabilities
> + * @clip_cpus: cpumask of cpus where the frequency constraints will happen
> + * @capacitance: the dynamic power coefficient for these cpus
> + *
> + * Similar to cpufreq_cooling_register() this function registers a
> + * cpufreq cooling device that has a simple power model so that
> + * get_cur/get_max/set_cur can operate with THERMAL_UNIT_POWER.
> + *
> + * See of_cpufreq_power_actor_register() for notes about the
> + * requisites for this to work.
> + *
> + * Return: a valid struct thermal_cooling_device pointer on success,
> + * on failure, it returns a corresponding ERR_PTR().
> + */
> +struct thermal_cooling_device *
> +cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance)
> +{
> + return of_cpufreq_power_actor_register(NULL, clip_cpus, capacitance);
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_power_actor_register);
This needs documentation under Documentation/thermal/sysfs-api.txt.
> +
> +/**
> * cpufreq_cooling_unregister - function to remove cpufreq cooling device.
> * @cdev: thermal cooling device pointer.
> *
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 5b46603cc7cd..1efaadf436fa 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -50,10 +50,10 @@ static DEFINE_IDR(thermal_cdev_idr);
> static DEFINE_MUTEX(thermal_idr_lock);
>
> static LIST_HEAD(thermal_tz_list);
> -static LIST_HEAD(thermal_cdev_list);
> +LIST_HEAD(thermal_cdev_list);
> static LIST_HEAD(thermal_governor_list);
>
> -static DEFINE_MUTEX(thermal_list_lock);
> +DEFINE_MUTEX(thermal_list_lock);
> static DEFINE_MUTEX(thermal_governor_lock);
>
> static struct thermal_governor *def_governor;
> diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
> index c303d383def1..b8d92ac4266b 100644
> --- a/include/linux/cpu_cooling.h
> +++ b/include/linux/cpu_cooling.h
> @@ -36,6 +36,9 @@
> struct thermal_cooling_device *
> cpufreq_cooling_register(const struct cpumask *clip_cpus);
>
> +struct thermal_cooling_device *
> +cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance);
> +
> /**
> * of_cpufreq_cooling_register - create cpufreq cooling device based on DT.
> * @np: a valid struct device_node to the cooling device device tree node.
> @@ -45,6 +48,11 @@ cpufreq_cooling_register(const struct cpumask *clip_cpus);
> struct thermal_cooling_device *
> of_cpufreq_cooling_register(struct device_node *np,
> const struct cpumask *clip_cpus);
> +
> +struct thermal_cooling_device *
> +of_cpufreq_power_actor_register(struct device_node *np,
> + const struct cpumask *clip_cpus,
> + u32 capacitance);
> #else
> static inline struct thermal_cooling_device *
> of_cpufreq_cooling_register(struct device_node *np,
> @@ -52,6 +60,14 @@ of_cpufreq_cooling_register(struct device_node *np,
> {
> return NULL;
> }
> +
> +static inline struct thermal_cooling_device *
> +of_cpufreq_power_actor_register(struct device_node *np,
> + const struct cpumask *clip_cpus,
> + u32 capacitance);
> +{
> + return ERR_PTR(-ENOSYS);
> +}
> #endif
>
> /**
> @@ -73,6 +89,18 @@ of_cpufreq_cooling_register(struct device_node *np,
> {
> return NULL;
> }
> +static inline struct thermal_cooling_device *
> +cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance)
> +{
> + return ERR_PTR(-ENOSYS);
> +}
> +static inline struct thermal_cooling_device *
> +of_cpufreq_power_actor_register(struct device_node *np,
> + const struct cpumask *clip_cpus,
> + u32 capacitance)
> +{
> + return ERR_PTR(-ENOSYS);
> +}
> static inline
> void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> {
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index 8d183b8255eb..5986f546ca98 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -59,6 +59,9 @@
> #define DEFAULT_THERMAL_GOVERNOR "user_space"
> #endif
>
> +extern struct list_head thermal_cdev_list;
> +extern struct mutex thermal_list_lock;
> +
> struct thermal_zone_device;
> struct thermal_cooling_device;
>
> @@ -108,6 +111,7 @@ enum {
>
> enum thermal_cooling_unit {
> THERMAL_UNIT_STATE,
> + THERMAL_UNIT_POWER,
> };
>
> struct thermal_zone_device_ops {
> --
> 1.7.9.5
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 4/5] thermal: introduce the Power Allocator governor
2014-05-06 12:06 ` [RFC PATCH 4/5] thermal: introduce the Power Allocator governor Javi Merino
@ 2014-05-13 23:20 ` Eduardo Valentin
2014-05-15 18:24 ` Javi Merino
0 siblings, 1 reply; 17+ messages in thread
From: Eduardo Valentin @ 2014-05-13 23:20 UTC (permalink / raw)
To: Javi Merino; +Cc: linux-pm, Punit.Agrawal, Zhang Rui
Hello Javi,
Here are a series of initial comments. I will be probably reviewing the
code several times and coming back again with more questions.
On Tue, May 06, 2014 at 01:06:37PM +0100, Javi Merino wrote:
> The power allocator governor is a thermal governor that controls system
> and device power allocation to control temperature. Conceptually, the
> implementation takes a system view of heat dissipation by managing
> multiple heat sources.
Can it be able to control several thermal zones then?
>
> This governor relies on power-aware cooling devices (power actors) to
> operate. That is, cooling devices whose thermal_cooling_device_ops
> accept THERMAL_UNIT_POWER.
>
> It uses a Proportional Integral (PI) controller driven by the
> temperature of the thermal zone. This budget is then allocated to
> each cooling device that can have bearing on the temperature we are
> trying to control. It decides how much power to give each cooling
> device based on the performance they are requesting. The PI
> controller ensures that the total power budget does not exceed the
> control temperature.
>
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
> drivers/thermal/Kconfig | 14 ++
> drivers/thermal/Makefile | 1 +
> drivers/thermal/power_allocator.c | 403 +++++++++++++++++++++++++++++++++++++
> drivers/thermal/thermal_core.c | 7 +-
> drivers/thermal/thermal_core.h | 8 +
> include/linux/thermal.h | 5 +
We need documentation under Documentation/thermal to understand
how to use this governor. Specially on how to accomodate
several actors and domains under its control.
> 6 files changed, 437 insertions(+), 1 deletion(-)
> create mode 100644 drivers/thermal/power_allocator.c
>
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 2d51912a6e40..f5e6b693fc19 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
> Select this if you want to let the user space manage the
> platform thermals.
>
> +config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
> + bool "power_allocator"
> + select THERMAL_GOV_POWER_ALLOCATOR
> + help
> + Select this if you want to control temperature based on
> + system and device power allocation. This governor relies on
> + power-aware cooling devices (power actors) to operate.
> +
> endchoice
>
> config THERMAL_GOV_FAIR_SHARE
> @@ -89,6 +97,12 @@ config THERMAL_GOV_USER_SPACE
> help
> Enable this to let the user space manage the platform thermals.
>
> +config THERMAL_GOV_POWER_ALLOCATOR
> + bool "Power allocator thermal governor"
> + help
> + Enable this to manage platform thermals by dynamically
> + allocating and limiting power to devices.
> +
> config CPU_THERMAL
> bool "generic cpu cooling support"
> depends on CPU_FREQ
> diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
> index 54e4ec9eb5df..1cb7d5fec8dc 100644
> --- a/drivers/thermal/Makefile
> +++ b/drivers/thermal/Makefile
> @@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF) += of-thermal.o
> thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE) += fair_share.o
> thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE) += step_wise.o
> thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE) += user_space.o
> +thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR) += power_allocator.o
>
> # cpufreq cooling
> thermal_sys-$(CONFIG_CPU_THERMAL) += cpu_cooling.o
> diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
> new file mode 100644
> index 000000000000..39c425abf10a
> --- /dev/null
> +++ b/drivers/thermal/power_allocator.c
> @@ -0,0 +1,403 @@
> +/*
> + * A power allocator to manage temperature
> + *
> + * Copyright (C) 2014 ARM Ltd.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> + * kind, whether express or implied; without even the implied warranty
> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +
> +#define pr_fmt(fmt) "Power allocator: " fmt
> +
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/thermal.h>
> +
> +#include "thermal_core.h"
> +
> +#define MAX_NUM_ACTORS 8
hmm... Why do we need to limit the amount of actors?
> +
> +#define FRAC_BITS 8
> +#define int_to_frac(x) ((x) << FRAC_BITS)
> +#define frac_to_int(x) ((x) >> FRAC_BITS)
> +
> +/**
> + * mul_frac - multiply two fixed-point numbers
> + * @x: first multiplicand
> + * @y: second multiplicand
> + *
> + * Returns the result of multiplying two fixed-point numbers. The
> + * result is also a fixed-point number.
> + */
> +static inline s64 mul_frac(s64 x, s64 y)
> +{
> + return (x * y) >> FRAC_BITS;
> +}
> +
> +enum power_allocator_trip_levels {
> + TRIP_SWITCH_ON = 0, /* Switch on PI controller */
> + TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
> +};
Can we reuse the trip types we already have?
> +
> +/**
> + * struct power_allocator_params - parameters for the power allocator governor
> + * @k_po: P parameter of the PI controller when overshooting (i.e., when
> + * temperature is below the target)
> + * @k_pi: P parameter of the PI controller when undershooting
> + * @k_i: I parameter of the PI controller
> + * @integral_cutoff: threshold below which the error is no longer accumulated
> + in the PI controller
> + * @err_integral: Accumulated error in the PI controller.
> + */
> +struct power_allocator_params {
> + s32 k_po;
> + s32 k_pu;
> + s32 k_i;
> + s32 integral_cutoff;
> + s32 err_integral;
> +};
We would need a documentation on how to define these values.
> +
> +/**
> + * pi_controller() - PI controller
> + * @tz: thermal zone we are operating in
> + * @control_temp: The target temperature
> + * @max_allocatable_power: maximum allocatable power for this thermal zone
> + *
> + * This PI controller increases the available power budget so that the
> + * temperature of the thermal zone gets as close as possible to
> + * @control_temp and limits the power if it exceeds it. k_po is the
> + * proportional term when we are overshooting, k_pu is the
> + * proportional term when we are undershooting. integral_cutoff is a
> + * threshold below which we stop accumulating the error. The
> + * accumulated error is only valid if the requested power will make
> + * the system warmer. If the system is mostly idle, there's no point
> + * in accumulating positive error.
> + *
> + * It returns the power budget for the next period.
> + */
> +static u32 pi_controller(struct thermal_zone_device *tz,
> + unsigned long current_temp, unsigned long control_temp,
> + unsigned long max_allocatable_power)
> +{
> + s64 p, i, power_range;
> + s32 err;
> + struct power_allocator_params *params = tz->governor_data;
> +
> + err = ((s32)control_temp - (s32)current_temp) / 1000;
> + err = int_to_frac(err);
> +
> + /* Calculate the proportional term */
> + p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
> +
> + /*
> + * Calculate the integral term
> + *
> + * if the error s less than cut off allow integration (but
> + * the integral is limited to max power)
> + */
> + i = mul_frac(params->k_i, params->err_integral);
> +
> + if (err < int_to_frac(params->integral_cutoff)) {
> + s64 tmpi = mul_frac(params->k_i, err);
> + tmpi += i;
> + if (tmpi <= int_to_frac(max_allocatable_power)) {
> + i = tmpi;
> + params->err_integral += err;
> + }
> + }
> +
> + power_range = p + i;
> +
> + /* feed-forward the known maximum dissipatable power */
> + power_range = tz->tzp->max_dissipatable_power +
> + frac_to_int(power_range);
> +
> + /* output should not exceed max power */
> + if (power_range > max_allocatable_power)
> + power_range = max_allocatable_power;
> +
> + return power_range >= 0 ? power_range : 0;
What does it mean return 0? The above if does not make completelly sense
to me, as power_range can still be 0. power_range < 0 means error?
what if the value at power_range overflows the u32?
> +}
> +
> +static void divvy_up_power(struct thermal_zone_device *tz,
Can you please document how you perform this power split?
> + unsigned long req_power[MAX_NUM_ACTORS],
> + unsigned long max_power[MAX_NUM_ACTORS],
> + int num_actors, unsigned long total_req_power,
> + u32 power_range,
> + unsigned long granted_power[MAX_NUM_ACTORS])
> +{
> + unsigned long extra_power, capped_extra_power;
> + unsigned long extra_actor_power[MAX_NUM_ACTORS];
> + int i;
> +
> + if (!total_req_power) {
> + /*
> + * Nobody requested anything, so just give everybody
> + * the maximum power
> + */
> + for (i = 0; i < num_actors; i++)
> + granted_power[i] = max_power[i];
> +
> + return;
> + }
> +
> + capped_extra_power = 0;
> + extra_power = 0;
> + for (i = 0; i < num_actors; i++) {
> + u64 req_range = req_power[i] * power_range;
> +
> + granted_power[i] = div_u64(req_range, total_req_power);
> +
> + if (granted_power[i] > max_power[i]) {
> + extra_power += granted_power[i] - max_power[i];
> + granted_power[i] = max_power[i];
> + }
> +
> + extra_actor_power[i] = max_power[i] - granted_power[i];
> + capped_extra_power += extra_actor_power[i];
> + }
> +
> + if (!extra_power)
> + return;
> +
> + /*
> + * Re-divvy the reclaimed extra among actors based on
> + * how far they are from the max
> + */
> + extra_power = min(extra_power, capped_extra_power);
> + if (capped_extra_power > 0)
> + for (i = 0; i < num_actors; i++)
> + granted_power[i] += (extra_actor_power[i] *
> + extra_power) / capped_extra_power;
> +}
> +
> +static int allocate_power(struct thermal_zone_device *tz,
> + unsigned long current_temp, unsigned long control_temp)
> +{
> + struct thermal_instance *instance;
> + unsigned long req_power[MAX_NUM_ACTORS], max_power[MAX_NUM_ACTORS];
> + unsigned long granted_power[MAX_NUM_ACTORS];
> + unsigned long total_req_power, max_allocatable_power;
> + u32 power_range;
> + int i, num_actors, ret = 0;
> +
> + i = 0;
> + total_req_power = 0;
> + max_allocatable_power = 0;
> +
> + mutex_lock(&tz->lock);
> +
> + list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> + struct thermal_cooling_device *cdev = instance->cdev;
> +
> + if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
How does one instance->trip gets the value of
TRIP_MAX_DESIRED_TEMPERATURE?
> + continue;
> +
> + if (i >= MAX_NUM_ACTORS) {
> + pr_warn("Too many actors (%d) for this governor\n", i);
dev_* family is preferrable over pr_* family. Same applies to other
patches.
> + ret = -EINVAL;
> + goto unlock;
> + }
> +
> + ret = cdev->ops->get_cur(cdev, &req_power[i],
> + THERMAL_UNIT_POWER);
> + if (ret) {
> + pr_warn_once("Couldn't get current power for %s: %d\n",
> + cdev->type, ret);
> + goto unlock;
> + }
> +
> + total_req_power += req_power[i];
> +
> + cdev->ops->get_max(cdev, &max_power[i], THERMAL_UNIT_POWER);
> + max_allocatable_power += max_power[i];
> +
> + i++;
> + }
> + num_actors = i;
> +
> + power_range = pi_controller(tz, current_temp, control_temp,
> + max_allocatable_power);
> +
> + divvy_up_power(tz, req_power, max_power, num_actors, total_req_power,
> + power_range, granted_power);
> +
> + i = 0;
> + list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> + struct thermal_cooling_device *cdev = instance->cdev;
> +
> + if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
> + continue;
> +
> + cdev->ops->set_cur(cdev, granted_power[i], THERMAL_UNIT_POWER);
> + i++;
> + }
> +
> +unlock:
> + mutex_unlock(&tz->lock);
> +
> + return ret;
> +}
> +
> +static int check_trips(struct thermal_zone_device *tz)
> +{
> + int ret;
> + enum thermal_trip_type type;
> +
> + if (tz->trips < 2)
> + return -EINVAL;
> +
> + ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
> + if (ret)
> + return ret;
> +
> + if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> + return -EINVAL;
> +
> + ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
> + if (ret)
> + return ret;
> +
> + if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> + return -EINVAL;
> +
> + return ret;
> +}
> +
> +static void reset_pi_controller(struct power_allocator_params *params)
> +{
> + params->err_integral = 0;
> +}
> +
> +/**
> + * power_allocator_bind - bind the power_allocator governor to a thermal zone
> + * @tz: thermal zone to bind it to
> + *
> + * Check that the thermal zone is valid for this governor: has two
> + * thermal trips. If so, initialize the PI controller parameters and
> + * bind it to the thermal zone.
> + *
> + * Returns 0 on success, -EINVAL if the trips were invalid or -ENOMEM
> + * if we ran out of memory.
> + */
> +static int power_allocator_bind(struct thermal_zone_device *tz)
> +{
> + int ret;
> + struct power_allocator_params *params;
> + unsigned long switch_on_temp, control_temp;
> + u32 temperature_threshold;
> +
> + ret = check_trips(tz);
> + if (ret) {
> + pr_err("thermal zone %s has the wrong number of trips for this governor\n",
> + tz->type);
> + return ret;
> + }
> +
> + if (!tz->tzp || !tz->tzp->max_dissipatable_power) {
> + pr_err("Trying to bind the power_allocator governor to a thermal zone without the max_dissipatable_power parameter\n");
Can you please break the above line?
> + return -EINVAL;
> + }
> +
> + params = kzalloc(sizeof(*params), GFP_KERNEL);
can we use devm_kzalloc?
> + if (!params)
> + return -ENOMEM;
> +
> + ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> + if (ret)
> + goto free;
> +
> + ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> + &control_temp);
thermal zone drivers are not aware of these trip types, are they?
> + if (ret)
> + goto free;
> +
> + temperature_threshold = (control_temp - switch_on_temp) / 1000;
> +
> + params->k_po = int_to_frac(tz->tzp->max_dissipatable_power) /
> + temperature_threshold;
> + params->k_pu = int_to_frac(2 * tz->tzp->max_dissipatable_power) /
> + temperature_threshold;
> + params->k_i = int_to_frac(1);
I suppose we need a way to input k_po, k_pu and k_i parameters from
platform code to this governor right? Or are they always defined as the
above code? always based on max_dissipated_power?
> + params->integral_cutoff = 0;
> +
> + reset_pi_controller(params);
> +
> + tz->governor_data = params;
> + return 0;
> +
> +free:
> + kfree(params);
> + return ret;
> +}
> +
> +static void power_allocator_unbind(struct thermal_zone_device *tz)
> +{
> + pr_debug("Unbinding from thermal zone %d\n", tz->id);
> + kfree(tz->governor_data);
> + tz->governor_data = NULL;
> +}
> +
> +static int power_allocator_throttle(struct thermal_zone_device *tz, int trip)
> +{
> + int ret;
> + unsigned long switch_on_temp, control_temp, current_temp;
> + struct power_allocator_params *params = tz->governor_data;
> +
> + /*
> + * We get called for every trip point but we only need to do
> + * our calculations once
> + */
> + if (trip != TRIP_MAX_DESIRED_TEMPERATURE)
> + return 0;
> +
> + ret = thermal_zone_get_temp(tz, ¤t_temp);
> + if (ret) {
> + pr_warn("Failed to get temperature: %d\n", ret);
> + return ret;
> + }
> +
> + ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> + if (ret) {
> + pr_warn("Failed to get switch on temperature: %d\n", ret);
> + return ret;
> + }
> +
> + if (current_temp < switch_on_temp) {
> + reset_pi_controller(params);
> + return 0;
> + }
> +
> + ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> + &control_temp);
> + if (ret) {
> + pr_warn("Failed to get the maximum desired temperature: %d\n",
> + ret);
> + return ret;
> + }
> +
> + return allocate_power(tz, current_temp, control_temp);
> +}
> +
> +static struct thermal_governor thermal_gov_power_allocator = {
> + .name = "power_allocator",
> + .bind_to_tz = power_allocator_bind,
> + .unbind_from_tz = power_allocator_unbind,
> + .throttle = power_allocator_throttle,
> +};
> +
> +int thermal_gov_power_allocator_register(void)
> +{
> + return thermal_register_governor(&thermal_gov_power_allocator);
> +}
> +
> +void thermal_gov_power_allocator_unregister(void)
> +{
> + thermal_unregister_governor(&thermal_gov_power_allocator);
> +}
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 1efaadf436fa..bed3d2ef8ef3 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -1840,7 +1840,11 @@ static int __init thermal_register_governors(void)
> if (result)
> return result;
>
> - return thermal_gov_user_space_register();
> + result = thermal_gov_user_space_register();
> + if (result)
> + return result;
> +
> + return thermal_gov_power_allocator_register();
> }
>
> static void thermal_unregister_governors(void)
> @@ -1848,6 +1852,7 @@ static void thermal_unregister_governors(void)
> thermal_gov_step_wise_unregister();
> thermal_gov_fair_share_unregister();
> thermal_gov_user_space_unregister();
> + thermal_gov_power_allocator_unregister();
> }
>
> static int __init thermal_init(void)
> diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h
> index 3db339fb636f..b24cde2c71cc 100644
> --- a/drivers/thermal/thermal_core.h
> +++ b/drivers/thermal/thermal_core.h
> @@ -77,6 +77,14 @@ static inline int thermal_gov_user_space_register(void) { return 0; }
> static inline void thermal_gov_user_space_unregister(void) {}
> #endif /* CONFIG_THERMAL_GOV_USER_SPACE */
>
> +#ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
> +int thermal_gov_power_allocator_register(void);
> +void thermal_gov_power_allocator_unregister(void);
> +#else
> +static inline int thermal_gov_power_allocator_register(void) { return 0; }
> +static inline void thermal_gov_power_allocator_unregister(void) {}
> +#endif /* CONFIG_THERMAL_GOV_POWER_ALLOCATOR */
> +
> /* device tree support */
> #ifdef CONFIG_THERMAL_OF
> int of_parse_thermal_zones(void);
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index 5986f546ca98..3861cb443520 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -57,6 +57,8 @@
> #define DEFAULT_THERMAL_GOVERNOR "fair_share"
> #elif defined(CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE)
> #define DEFAULT_THERMAL_GOVERNOR "user_space"
> +#elif defined(CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR)
> +#define DEFAULT_THERMAL_GOVERNOR "power_allocator"
> #endif
>
> extern struct list_head thermal_cdev_list;
> @@ -250,6 +252,9 @@ struct thermal_zone_params {
>
> int num_tbps; /* Number of tbp entries */
> struct thermal_bind_params *tbp;
> +
> + /* Maximum power (heat) that this thermal zone can dissipate in mW */
> + u32 max_dissipatable_power;
> };
>
> struct thermal_genl_event {
> --
> 1.7.9.5
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/5] The power allocator thermal governor
2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
` (4 preceding siblings ...)
2014-05-06 12:06 ` [RFC PATCH 5/5] thermal: add trace events to the power allocator governor Javi Merino
@ 2014-05-14 10:05 ` James King
5 siblings, 0 replies; 17+ messages in thread
From: James King @ 2014-05-14 10:05 UTC (permalink / raw)
To: linux-pm
Hi Javi and Punit,
We have integrated this patchset onto an internal dual ARM cluster SoC and
it does do a good job of holding the SoC temperature steady while running
high performance benchmarks. Happy to be added to all the patches as "verified
by".
Thanks, James
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone
2014-05-13 14:31 ` edubezval
@ 2014-05-15 14:30 ` Javi Merino
2014-05-15 15:10 ` Eduardo Valentin
0 siblings, 1 reply; 17+ messages in thread
From: Javi Merino @ 2014-05-15 14:30 UTC (permalink / raw)
To: edubezval@gmail.com; +Cc: linux-pm@vger.kernel.org, Punit Agrawal, Zhang Rui
On Tue, May 13, 2014 at 03:31:08PM +0100, edubezval@gmail.com wrote:
> Hello Javi,
>
> Thanks for your patch. Find some comment inline.
>
> On Tue, May 06, 2014 at 01:06:34PM +0100, Javi Merino wrote:
> > A governor may need to store its current state between calls to
> > throttle(). That state depends on the thermal zone, so store it as
> > private data in struct thermal_zone_device.
> >
> > The governors may have two new ops: bind_to_tz() and unbind_from_tz().
> > When provided, these functions let governors do some initialization
> > and teardown when they are bound/unbound to a tz and possibly store that
> > information in the governor_data field of the struct
> > thermal_zone_device.
>
>
> Overall the idea is fine to me, although so far we have no case of such
> need in the current governors. I will have a look on your proposed
> governor to check futher on this API.
>
> >
> > Cc: Zhang Rui <rui.zhang@intel.com>
> > Cc: Eduardo Valentin <edubezval@gmail.com>
> > Signed-off-by: Javi Merino <javi.merino@arm.com>
> > ---
> > drivers/thermal/thermal_core.c | 67 +++++++++++++++++++++++++++++++++++-----
> > include/linux/thermal.h | 3 ++
> > 2 files changed, 62 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> > index 71b0ec0c370d..ac8e5990a63e 100644
> > --- a/drivers/thermal/thermal_core.c
> > +++ b/drivers/thermal/thermal_core.c
> > @@ -72,6 +72,45 @@ static struct thermal_governor *__find_governor(const char *name)
> > return NULL;
> > }
> >
> > +/**
> > + * thermal_set_governor() - Switch to another governor
> > + * @tz: a valid pointer to a struct thermal_zone_device
> > + * @new_gov: pointer to the new governor
> > + *
> > + * Change the governor of thermal zone @tz.
> > + *
> > + * Returns 0 on success, an error if the new governor's bind_to_tz() failed.
> > + */
> > +static int thermal_set_governor(struct thermal_zone_device *tz,
> > + struct thermal_governor *new_gov)
> > +{
> > + int ret = 0;
> > +
> > + if (tz->governor && tz->governor->unbind_from_tz)
> > + tz->governor->unbind_from_tz(tz);
> > +
> > + if (new_gov && new_gov->bind_to_tz) {
> > + ret = new_gov->bind_to_tz(tz);
> > + if (ret) {
> > + /* Try to register the old governor again */
>
> I suggest to state in the Documentation that a governor may be called
> to bind and unbind to same thermal zone several times, as per above code.
Will do, but I find that the other governor op (throttle) is not
documented. Where do you suggest we should put it?
> > + if (tz->governor && tz->governor->bind_to_tz) {
> > + if (tz->governor->bind_to_tz(tz))
> > + /*
> > + * The new governor failed to register
> > + * and the previous one failed as well
> > + */
>
> I suggest having a dev_warn here to notify that a zone is now orphan due to
> switching from one governor to another.
Done
> > + tz->governor = NULL;
> > + }
> > +
> > + return ret;
> > + }
> > + }
> > +
> > + tz->governor = new_gov;
> > +
> > + return ret;
> > +}
> > +
> > int thermal_register_governor(struct thermal_governor *governor)
> > {
> > int err;
> > @@ -104,8 +143,12 @@ int thermal_register_governor(struct thermal_governor *governor)
> >
> > name = pos->tzp->governor_name;
> >
> > - if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
> > - pos->governor = governor;
> > + if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
> > + int ret = thermal_set_governor(pos, governor);
>
> nip: add a blank line between declarations and first statement.
Ok
> > + if (ret)
> > + pr_warn("Failed to set governor %s for zone %d: %d\n",
> > + governor->name, pos->id, ret);
>
> dev_* family of logging functions is preferred in this file.
Done
> > + }
> > }
> >
> > mutex_unlock(&thermal_list_lock);
> > @@ -131,7 +174,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
> > list_for_each_entry(pos, &thermal_tz_list, node) {
> > if (!strnicmp(pos->governor->name, governor->name,
> > THERMAL_NAME_LENGTH))
> > - pos->governor = NULL;
> > + thermal_set_governor(pos, NULL);
> > }
> >
> > mutex_unlock(&thermal_list_lock);
> > @@ -756,8 +799,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
> > if (!gov)
> > goto exit;
> >
> > - tz->governor = gov;
> > - ret = count;
> > + ret = thermal_set_governor(tz, gov);
> > + if (!ret)
> > + ret = count;
> >
> > exit:
> > mutex_unlock(&thermal_governor_lock);
> > @@ -1452,6 +1496,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> > int result;
> > int count;
> > int passive = 0;
> > + struct thermal_governor *governor;
> >
> > if (type && strlen(type) >= THERMAL_NAME_LENGTH)
> > return ERR_PTR(-EINVAL);
> > @@ -1542,9 +1587,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> > mutex_lock(&thermal_governor_lock);
> >
> > if (tz->tzp)
> > - tz->governor = __find_governor(tz->tzp->governor_name);
> > + governor = __find_governor(tz->tzp->governor_name);
> > else
> > - tz->governor = def_governor;
> > + governor = def_governor;
> > +
> > + result = thermal_set_governor(tz, governor);
> > + if (result) {
> > + mutex_unlock(&thermal_governor_lock);
> > + goto unregister;
> > + }
> >
> > mutex_unlock(&thermal_governor_lock);
> >
> > @@ -1634,7 +1685,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
> > device_remove_file(&tz->device, &dev_attr_mode);
> > device_remove_file(&tz->device, &dev_attr_policy);
> > remove_trip_attrs(tz);
> > - tz->governor = NULL;
> > + thermal_set_governor(tz, NULL);
> >
> > thermal_remove_hwmon_sysfs(tz);
> > release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
> > diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> > index f7e11c7ea7d9..baac212f815e 100644
> > --- a/include/linux/thermal.h
> > +++ b/include/linux/thermal.h
> > @@ -177,6 +177,7 @@ struct thermal_zone_device {
> > struct thermal_zone_device_ops *ops;
> > const struct thermal_zone_params *tzp;
> > struct thermal_governor *governor;
> > + void *governor_data;
>
> Please document this entry.
Where? None of these entries are documented. I guess I'll have to
add a patch that describes each one of them and then add ours.
> > struct list_head thermal_instances;
> > struct idr idr;
> > struct mutex lock; /* protect thermal_instances list */
> > @@ -187,6 +188,8 @@ struct thermal_zone_device {
> > /* Structure that holds thermal governor information */
> > struct thermal_governor {
> > char name[THERMAL_NAME_LENGTH];
> > + int (*bind_to_tz)(struct thermal_zone_device *tz);
> > + void (*unbind_from_tz)(struct thermal_zone_device *tz);
>
> a governor cannot prevent unbinding?
To me it's like open()/close() or malloc()/free(). You generally can
fail to bind, but not fail to unbind. I can change it to an int if
you think it is something that we may need.
> Please, document the proposed operations.
Same as before, none of the entries in this struct are documented so
I'll submit a patch that describes the current ones and then add our
new ones in our patch.
> > int (*throttle)(struct thermal_zone_device *tz, int trip);
> > struct list_head governor_list;
> > };
> > --
> > 1.7.9.5
>
> I request that you update Documentation/thermal/sysfs-api.txt too.
Will do.
Cheers,
Javi
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone
2014-05-15 14:30 ` Javi Merino
@ 2014-05-15 15:10 ` Eduardo Valentin
0 siblings, 0 replies; 17+ messages in thread
From: Eduardo Valentin @ 2014-05-15 15:10 UTC (permalink / raw)
To: Javi Merino; +Cc: linux-pm@vger.kernel.org, Punit Agrawal, Zhang Rui
Javi,
On Thu, May 15, 2014 at 03:30:53PM +0100, Javi Merino wrote:
> On Tue, May 13, 2014 at 03:31:08PM +0100, edubezval@gmail.com wrote:
> > Hello Javi,
> >
> > Thanks for your patch. Find some comment inline.
> >
[cut]
>
> Will do, but I find that the other governor op (throttle) is not
> documented. Where do you suggest we should put it?
>
The main entry for thermal Documentation is Documentation/thermal/. The
sysfs-api.txt file has major info about the thermal framework. But for
this case, I agree with you, we deserve a better explanation. I think
the fact that the governor API is becoming more interesting is a hint
that we need better documentation. I suggest having another file
describing the governor API and how it deals with other entities such
as sensor inputs and cooling actions.
Rui, any suggestion here?
> > > + if (tz->governor && tz->governor->bind_to_tz) {
> > > + if (tz->governor->bind_to_tz(tz))
> > > + /*
> > > + * The new governor failed to register
> > > + * and the previous one failed as well
> > > + */
> >
> > I suggest having a dev_warn here to notify that a zone is now orphan due to
> > switching from one governor to another.
>
> Done
>
> > > + tz->governor = NULL;
> > > + }
> > > +
> > > + return ret;
> > > + }
> > > + }
> > > +
> > > + tz->governor = new_gov;
> > > +
> > > + return ret;
> > > +}
> > > +
> > > int thermal_register_governor(struct thermal_governor *governor)
> > > {
> > > int err;
> > > @@ -104,8 +143,12 @@ int thermal_register_governor(struct thermal_governor *governor)
> > >
> > > name = pos->tzp->governor_name;
> > >
> > > - if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
> > > - pos->governor = governor;
> > > + if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
> > > + int ret = thermal_set_governor(pos, governor);
> >
> > nip: add a blank line between declarations and first statement.
>
> Ok
>
> > > + if (ret)
> > > + pr_warn("Failed to set governor %s for zone %d: %d\n",
> > > + governor->name, pos->id, ret);
> >
> > dev_* family of logging functions is preferred in this file.
>
> Done
>
> > > + }
> > > }
> > >
> > > mutex_unlock(&thermal_list_lock);
> > > @@ -131,7 +174,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
> > > list_for_each_entry(pos, &thermal_tz_list, node) {
> > > if (!strnicmp(pos->governor->name, governor->name,
> > > THERMAL_NAME_LENGTH))
> > > - pos->governor = NULL;
> > > + thermal_set_governor(pos, NULL);
> > > }
> > >
> > > mutex_unlock(&thermal_list_lock);
> > > @@ -756,8 +799,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
> > > if (!gov)
> > > goto exit;
> > >
> > > - tz->governor = gov;
> > > - ret = count;
> > > + ret = thermal_set_governor(tz, gov);
> > > + if (!ret)
> > > + ret = count;
> > >
> > > exit:
> > > mutex_unlock(&thermal_governor_lock);
> > > @@ -1452,6 +1496,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> > > int result;
> > > int count;
> > > int passive = 0;
> > > + struct thermal_governor *governor;
> > >
> > > if (type && strlen(type) >= THERMAL_NAME_LENGTH)
> > > return ERR_PTR(-EINVAL);
> > > @@ -1542,9 +1587,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> > > mutex_lock(&thermal_governor_lock);
> > >
> > > if (tz->tzp)
> > > - tz->governor = __find_governor(tz->tzp->governor_name);
> > > + governor = __find_governor(tz->tzp->governor_name);
> > > else
> > > - tz->governor = def_governor;
> > > + governor = def_governor;
> > > +
> > > + result = thermal_set_governor(tz, governor);
> > > + if (result) {
> > > + mutex_unlock(&thermal_governor_lock);
> > > + goto unregister;
> > > + }
> > >
> > > mutex_unlock(&thermal_governor_lock);
> > >
> > > @@ -1634,7 +1685,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
> > > device_remove_file(&tz->device, &dev_attr_mode);
> > > device_remove_file(&tz->device, &dev_attr_policy);
> > > remove_trip_attrs(tz);
> > > - tz->governor = NULL;
> > > + thermal_set_governor(tz, NULL);
> > >
> > > thermal_remove_hwmon_sysfs(tz);
> > > release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
> > > diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> > > index f7e11c7ea7d9..baac212f815e 100644
> > > --- a/include/linux/thermal.h
> > > +++ b/include/linux/thermal.h
> > > @@ -177,6 +177,7 @@ struct thermal_zone_device {
> > > struct thermal_zone_device_ops *ops;
> > > const struct thermal_zone_params *tzp;
> > > struct thermal_governor *governor;
> > > + void *governor_data;
> >
> > Please document this entry.
>
> Where? None of these entries are documented. I guess I'll have to
> add a patch that describes each one of them and then add ours.
>
> > > struct list_head thermal_instances;
> > > struct idr idr;
> > > struct mutex lock; /* protect thermal_instances list */
> > > @@ -187,6 +188,8 @@ struct thermal_zone_device {
> > > /* Structure that holds thermal governor information */
> > > struct thermal_governor {
> > > char name[THERMAL_NAME_LENGTH];
> > > + int (*bind_to_tz)(struct thermal_zone_device *tz);
> > > + void (*unbind_from_tz)(struct thermal_zone_device *tz);
> >
> > a governor cannot prevent unbinding?
>
> To me it's like open()/close() or malloc()/free(). You generally can
> fail to bind, but not fail to unbind. I can change it to an int if
> you think it is something that we may need.
>
> > Please, document the proposed operations.
>
> Same as before, none of the entries in this struct are documented so
> I'll submit a patch that describes the current ones and then add our
> new ones in our patch.
>
> > > int (*throttle)(struct thermal_zone_device *tz, int trip);
> > > struct list_head governor_list;
> > > };
> > > --
> > > 1.7.9.5
> >
> > I request that you update Documentation/thermal/sysfs-api.txt too.
>
> Will do.
>
> Cheers,
> Javi
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 3/5] thermal: add a basic cpu power actor
2014-05-13 22:57 ` Eduardo Valentin
@ 2014-05-15 17:02 ` Javi Merino
0 siblings, 0 replies; 17+ messages in thread
From: Javi Merino @ 2014-05-15 17:02 UTC (permalink / raw)
To: Eduardo Valentin; +Cc: linux-pm@vger.kernel.org, Punit Agrawal, Zhang Rui
Hi Eduardo,
On Tue, May 13, 2014 at 11:57:52PM +0100, Eduardo Valentin wrote:
> On Tue, May 06, 2014 at 01:06:36PM +0100, Javi Merino wrote:
> > Introduce a new thermal_cooling_unit: THERMAL_UNIT_POWER and let
> > cpu_cooling device operate using it as well as "states". This allows
> > governors that call cpufreq_get_max/cpufreq_get_cur/cpufreq_set_cur()
> > with THERMAL_UNIT_POWER to use cpufreq cooling devices.
> >
> > If the cpu cooling device is registered using
> > {of_,}cpufreq_power_actor_register() instead of
> > {of_,}cpufreq_cooling_register() it uses the current frequency (as
> > reported by cpufreq) as well as load and OPPs for the power
> > calculations. The cpus must have registered their OPPs in the OPP
> > library.
> >
> > Cc: Zhang Rui <rui.zhang@intel.com>
> > Cc: Eduardo Valentin <edubezval@gmail.com>
> > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> > Signed-off-by: Javi Merino <javi.merino@arm.com>
> > ---
> > drivers/thermal/cpu_cooling.c | 430 ++++++++++++++++++++++++++++++++++++----
> > drivers/thermal/thermal_core.c | 4 +-
> > include/linux/cpu_cooling.h | 28 +++
> > include/linux/thermal.h | 4 +
> > 4 files changed, 422 insertions(+), 44 deletions(-)
>
> Could you please also document under Documentation/thermal/sysfs-api.txt the exposed APIs ?
>
>
> >
> > diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> > index f3f4e6b5798e..8052cd1f4a39 100644
> > --- a/drivers/thermal/cpu_cooling.c
> > +++ b/drivers/thermal/cpu_cooling.c
> > @@ -24,10 +24,27 @@
> > #include <linux/thermal.h>
> > #include <linux/cpufreq.h>
> > #include <linux/err.h>
> > +#include <linux/pm_opp.h>
> > #include <linux/slab.h>
> > #include <linux/cpu.h>
> > #include <linux/cpu_cooling.h>
> >
> > +/* Maximum number of opps we can track */
> > +#define MAX_NUM_OPPS 32
> > +
> > +/**
> > + * struct power_table - frequency to power conversion
> > + * @frequency: frequency in KHz
> > + * @power: power in mW
> > + *
> > + * This structure is built when the cooling device registers and helps
> > + * in translating frequency to power and viceversa.
> > + */
> > +struct power_table {
> > + u32 frequency;
> > + u32 power;
> > +};
> > +
> > /**
> > * struct cpufreq_cooling_device - data for cooling device with cpufreq
> > * @id: unique integer value corresponding to each cpufreq_cooling_device
> > @@ -39,6 +56,14 @@
> > * @cpufreq_val: integer value representing the absolute value of the clipped
> > * frequency.
> > * @allowed_cpus: all the cpus involved for this cpufreq_cooling_device.
> > + * @power_table: array of struct power_table for frequency to power conversion
> > + * @power_table_entries: number of entries in the @power_table array
> > + * @time_in_idle: previous reading of the absolute time that this cpu was idle
> > + * @time_in_idle_timestamp: wall time of the last invocation of
> > + * get_cpu_idle_time_us()
> > + * @last_load: load measured by the latest call to cpufreq_get_cur()
> > + * @freq: frequency in KHz of the cpus represented by the cooling device
> > + * @capacitance: the dynamic power coefficient of these cpus
> > *
> > * This structure is required for keeping information of each
> > * cpufreq_cooling_device registered. In order to prevent corruption of this a
> > @@ -50,11 +75,19 @@ struct cpufreq_cooling_device {
> > unsigned int cpufreq_state;
> > unsigned int cpufreq_val;
> > struct cpumask allowed_cpus;
> > + struct power_table power_table[MAX_NUM_OPPS];
>
> Can we build this table dynamically?
>
> I believe we should be able to fetch this table from opp layer. Besides,
> should not be trouble of this code to estimate the max number of opps.
Ok, I'll change that as you suggested below and remove MAX_NUM_OPPS.
> > + int power_table_entries;
> > + u64 time_in_idle[NR_CPUS];
> > + u64 time_in_idle_timestamp[NR_CPUS];
> > + u32 last_load;
> > + u32 freq;
> > + u32 capacitance;
> > };
> > static DEFINE_IDR(cpufreq_idr);
> > static DEFINE_MUTEX(cooling_cpufreq_lock);
> >
> > static unsigned int cpufreq_dev_count;
> > +static bool power_actor_notifier_registered;
>
> Are you sure you need this bool? Shouldn' t you be using a counter?
Yes, that should be a counter, like cpufreq_dev_count. It's used
exactly like cpufreq_dev_count (to prevent registering the same
notifier_block more than once) so it should have the same semantics.
> You probably need locking on this.
True, added.
> > /* notify_table passes value to the CPUFREQ_ADJUST callback function. */
> > #define NOTIFY_INVALID NULL
> > @@ -115,6 +148,59 @@ static int is_cpufreq_valid(int cpu)
> > return !cpufreq_get_policy(&policy, cpu);
> > }
> >
> > +static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_device,
> > + u32 freq)
> > +{
> > + int i;
> > + struct power_table *pt = cpufreq_device->power_table;
> > +
> > + for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
> > + if (freq <= pt[i].frequency)
> > + break;
> > +
> > + return pt[i].power;
> > +}
> > +
> > +static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_device,
> > + u32 power)
> > +{
> > + int i;
> > + struct power_table *pt = cpufreq_device->power_table;
> > +
> > + for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
> > + if (power <= pt[i].power)
> > + break;
> > +
> > + return pt[i].frequency;
> > +}
> > +
> > +/**
> > + * get_load - get load for a cpu since last updated
> > + * @cpufreq_device: struct cpufreq_cooling_device for this cooling device
> > + * @cpu: cpu number
> > + *
> > + * Return the average load of cpu @cpu in percentage since this
> > + * function was last called.
> > + */
> > +static u32 get_load(struct cpufreq_cooling_device *cpufreq_device, int cpu)
> > +{
> > + u32 load;
> > + u64 now, now_idle, delta_time, delta_idle;
> > +
> > + now_idle = get_cpu_idle_time(cpu, &now, 0);
> > + delta_idle = now_idle - cpufreq_device->time_in_idle[cpu];
> > + delta_time = now - cpufreq_device->time_in_idle_timestamp[cpu];
> > +
> > + if (delta_time <= delta_idle)
> > + load = 0;
> > + else
> > + load = div64_u64(100 * (delta_time - delta_idle), delta_time);
> > +
> > + cpufreq_device->time_in_idle[cpu] = now_idle;
> > + cpufreq_device->time_in_idle_timestamp[cpu] = now;
>
> nip: blank line here.
Ok
> > + return load;
> > +}
> > +
> > enum cpufreq_cooling_property {
> > GET_LEVEL,
> > GET_FREQ,
> > @@ -342,98 +428,193 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
> > return 0;
> > }
> >
> > +/**
> > + * cpufreq_frequency_change - notifier callback for cpufreq frequency changes
> > + * @nb: struct notifier_block * with callback info
> > + * @event: value showing cpufreq event for which this function invoked
> > + * @data: callback-specific data
> > + *
> > + * Callback to get notifications of frequency changes. In the
> > + * CPUFREQ_POSTCHANGE @event we store the new frequency so that
> > + * cpufreq_get_cur() knows the current frequency and can convert it
> > + * into power.
> > + */
> > +static int cpufreq_frequency_change(struct notifier_block *nb,
> > + unsigned long event, void *data)
> > +{
> > + struct thermal_cooling_device *cdev;
> > + struct cpufreq_freqs *freqs = data;
> > +
> > + /* Only update frequency on postchange */
> > + if (event != CPUFREQ_POSTCHANGE)
> > + return NOTIFY_DONE;
> > +
> > + mutex_lock(&thermal_list_lock);
> > +
> > + list_for_each_entry(cdev, &thermal_cdev_list, node) {
> > + struct cpufreq_cooling_device *cpufreq_device;
> > +
> > + if (strncmp(cdev->type, "thermal-cpufreq", THERMAL_NAME_LENGTH))
> > + continue;
> > +
>
> I am not sure I follow why you need to go through the list of cooling devices
> here. Can you please elaborate a bit more?
Because of the code below, you need to find the cpufreq_device for
this cpu to update its frequency. I've reworked most of this patch
and I think this loop is now clearer (and simpler) in the new version.
> > + cpufreq_device = cdev->devdata;
> > +
> > + if (cpumask_test_cpu(freqs->cpu, &cpufreq_device->allowed_cpus))
> > + cpufreq_device->freq = freqs->new;
> > + }
> > +
> > + mutex_unlock(&thermal_list_lock);
> > +
> > + return NOTIFY_OK;
> > +}
> > +
> > /* cpufreq cooling device callback functions are defined below */
> >
> > /**
> > - * cpufreq_get_max_state - callback function to get the max cooling state.
> > + * cpufreq_get_max - callback function to get the max cooling state/power.
> > * @cdev: thermal cooling device pointer.
> > - * @state: fill this variable with the max cooling state.
> > - * @unit: the units in which @state should be in.
> > + * @val: fill this variable with the max cooling state/power.
> > + * @unit: whether to put state or power in @val.
> > *
> > * Callback for the thermal cooling device to return the cpufreq max
> > - * cooling state. Currently only THERMAL_UNIT_STATE is valid for
> > - * @unit.
> > + * cooling state/power. If @unit is THERMAL_UNIT_STATE, @val contains
> > + * the maximum cooling state; if @unit is THERMAL_UNIT_POWER, @val
> > + * maximum is power in milliwatts.
> > *
> > * Return: 0 on success, an error code otherwise.
> > */
> > -static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
> > - unsigned long *state,
> > - enum thermal_cooling_unit unit)
> > +static int cpufreq_get_max(struct thermal_cooling_device *cdev,
> > + unsigned long *val, enum thermal_cooling_unit unit)
>
> Shouldn't this change be part of previous patch?
I've nuked most of this and moved it to a specific file. Bottom line
is, extending the cooling device interface is a bit of a stretch and
doesn't work as well as I thought. Take for example this function:
cpufreq_get_max(cdev, &val, THERMAL_UNIT_STATE) gives you the maximum
cooling state, i.e. minimum frequency. OTOH,
cpufreq_get_max(cdev, &val, THERMAL_UNIT_POWER) gives you the maximum
power, i.e. the power consumption at maximum frequency.
Not only that, the interface is to rigid to allow for
platform-specific power models. More on that below, as you've pointed
out some of the issues.
> > {
> > struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
> > struct cpumask *mask = &cpufreq_device->allowed_cpus;
> > unsigned int cpu;
> > unsigned int count = 0;
> > - int ret;
> > -
> > - if (unit != THERMAL_UNIT_STATE)
> > - return -EINVAL;
>
> What happens to this code if tomorrow we add yet another unit?
>
>
> > + int ret = 0;
> >
> > cpu = cpumask_any(mask);
> >
> > - ret = get_property(cpu, 0, &count, GET_MAXL);
> > + if ((unit == THERMAL_UNIT_POWER) &&
> > + (cpufreq_device->power_table_entries)) {
>
> I think a cleaner code would use a switch case statement. You should
> deal with the power_table_entries restriction separatedly.
>
> > + unsigned int num_cpus;
> > + unsigned long max_freq;
> > +
> > + max_freq = get_cpu_frequency(cpu, 0);
> > + if (!max_freq)
> > + return -EINVAL;
> >
> > - if (count > 0)
> > - *state = count;
> > + num_cpus = cpumask_weight(mask);
> > + *val = cpu_freq_to_power(cdev->devdata, max_freq) * num_cpus;
> > + } else if (unit == THERMAL_UNIT_STATE) {
> > + ret = get_property(cpu, 0, &count, GET_MAXL);
> > +
> > + if (count > 0)
> > + *val = count;
> > + } else {
> > + ret = -EINVAL;
> > + }
> >
> > return ret;
> > }
> >
> > /**
> > - * cpufreq_get_cur_state - callback function to get the current cooling state.
> > + * cpufreq_get_cur - callback function to get the current cooling state/power.
> > * @cdev: thermal cooling device pointer.
> > - * @state: fill this variable with the current cooling state.
> > - * @unit: the units in which state should be in
> > + * @val: fill this variable with the current cooling state/power.
> > + * @unit: whether to put state or power in @val.
> > *
> > * Callback for the thermal cooling device to return the cpufreq
> > - * current cooling state. Currently only THERMAL_UNIT_STATE is valid
> > - * for @unit.
> > + * current cooling state/power. If @unit is THERMAL_UNIT_STATE, @val
> > + * contains the current cooling state; if @unit is THERMAL_UNIT_POWER,
> > + * @val is current power consumption in milliwatts.
> > *
> > * Return: 0 on success, an error code otherwise.
> > */
> > -static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
> > - unsigned long *state,
> > - enum thermal_cooling_unit unit)
> > +static int cpufreq_get_cur(struct thermal_cooling_device *cdev,
> > + unsigned long *val, enum thermal_cooling_unit unit)
> > {
> > struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
> >
> > - if (unit != THERMAL_UNIT_STATE)
> > - return -EINVAL;
> > + if ((unit == THERMAL_UNIT_POWER) &&
> > + (cpufreq_device->power_table_entries)) {
>
> ditto.
>
> > + int cpu;
> > + u32 total_load = 0, raw_cpu_power, power = 0;
> > +
> > + raw_cpu_power = cpu_freq_to_power(cpufreq_device,
> > + cpufreq_device->freq);
> > +
> > + for_each_cpu(cpu, &cpufreq_device->allowed_cpus) {
> > + u32 load;
> >
> > - *state = cpufreq_device->cpufreq_state;
> > + if (!cpu_online(cpu))
> > + continue;
> > +
> > + load = get_load(cpufreq_device, cpu);
> > + power += (raw_cpu_power * load) / 100;
> > + total_load += load;
> > + }
> > +
> > + cpufreq_device->last_load = total_load;
> > + *val = power;
> > + } else if (unit == THERMAL_UNIT_STATE) {
> > + *val = cpufreq_device->cpufreq_state;
> > + } else {
> > + return -EINVAL;
> > + }
> >
> > return 0;
> > }
> >
> > /**
> > - * cpufreq_set_cur_state - callback function to set the current cooling state.
> > + * cpufreq_set_cur - callback function to set the current cooling state/power.
> > * @cdev: thermal cooling device pointer.
> > - * @state: set this variable to the current cooling state.
> > - * @unit: the units in which @state should be in.
> > + * @val: set this variable to the current cooling state/power.
> > + * @unit: whether @val is a cooling state or power.
> > *
> > * Callback for the thermal cooling device to change the cpufreq
> > - * current cooling state. Currently only THERMAL_UNIT_STATE is valid for
> > - * @unit.
> > + * current cooling state/power. If @unit is THERMAL_UNIT_STATE, @val
> > + * is a cooling state; if @unit is THERMAL_UNIT_POWER, @val is the target
> > + * power consumption in milliwatts.
> > *
> > * Return: 0 on success, an error code otherwise.
> > */
> > -static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> > - unsigned long state,
> > - enum thermal_cooling_unit unit)
> > +static int cpufreq_set_cur(struct thermal_cooling_device *cdev,
> > + unsigned long val, enum thermal_cooling_unit unit)
> > {
> > struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
> > -
> > - if (unit != THERMAL_UNIT_STATE)
> > + unsigned long state;
> > +
> > + if ((unit == THERMAL_UNIT_POWER) &&
> > + (cpufreq_device->power_table_entries)) {
>
> ditto.
>
> > + unsigned int cpu, freq;
> > + u32 normalised_power, last_load;
> > +
> > + cpu = cpumask_any(&cpufreq_device->allowed_cpus);
> > + last_load = cpufreq_device->last_load ?
> > + cpufreq_device->last_load : 1;
> > + normalised_power = (val * 100) / last_load;
> > + freq = cpu_power_to_freq(cpufreq_device, normalised_power);
> > +
> > + state = cpufreq_cooling_get_level(cpu, freq);
> > + if (state == THERMAL_CSTATE_INVALID) {
> > + pr_err("Failed to convert %dKHz for cpu %d into a cdev state\n",
> > + freq, cpu);
> > + return -EINVAL;
> > + }
> > + } else if (unit == THERMAL_UNIT_STATE) {
> > + state = val;
> > + } else {
> > return -EINVAL;
> > + }
> >
> > return cpufreq_apply_cooling(cpufreq_device, state);
> > }
> >
> > /* Bind cpufreq callbacks to thermal cooling device ops */
> > static struct thermal_cooling_device_ops const cpufreq_cooling_ops = {
> > - .get_max = cpufreq_get_max_state,
> > - .get_cur = cpufreq_get_cur_state,
> > - .set_cur = cpufreq_set_cur_state,
> > + .get_max = cpufreq_get_max,
> > + .get_cur = cpufreq_get_cur,
> > + .set_cur = cpufreq_set_cur,
> > };
> >
> > /* Notifier for cpufreq policy change */
> > @@ -441,10 +622,95 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
> > .notifier_call = cpufreq_thermal_notifier,
> > };
> >
> > +/* Notifier for cpufreq frequency changes */
> > +struct notifier_block cpufreq_transition_notifier = {
> > + .notifier_call = cpufreq_frequency_change,
> > +};
> > +
> > +/**
> > + * build_cpu_power_table - create a power to frequency table
> > + * @cpufreq_dev: the cpufreq_cooling_device in which to store the table
> > + *
> > + * Build a power to frequency table for this cpu and store it in
> > + * @cpufreq_dev. This table will be used in cpu_power_to_freq() and
> > + * cpu_freq_to_power() to convert between power and frequency
> > + * efficiently. Power is stored in mW, frequency in KHz. The
> > + * resulting table is in ascending order.
> > + *
> > + * Returns 0 on success, -E* on error.
> > + */
> > +static int build_cpu_power_table(struct cpufreq_cooling_device *cpufreq_dev)
> > +{
> > + struct dev_pm_opp *opp;
> > + struct device *dev = NULL;
> > + int num_opps, cpu, i, ret = 0;
> > + unsigned long freq;
> > +
> > + num_opps = 0;
> > +
> > + rcu_read_lock();
> > +
> > + for_each_cpu(cpu, &cpufreq_dev->allowed_cpus) {
> > + dev = get_cpu_device(cpu);
> > + if (!dev)
> > + continue;
> > +
> > + num_opps = dev_pm_opp_get_opp_count(dev);
> > + if (num_opps > 0) {
> > + break;
> > + } else if (num_opps < 0) {
> > + ret = num_opps;
> > + goto unlock;
> > + }
> > + }
> > +
> > + if ((num_opps == 0) || (num_opps > MAX_NUM_OPPS)) {
>
> hmm.. Does it mean that if an existing platform has more than 32
> opps, we won't support it? I would preffer we dynamically allocate
> the table here and cover any number of opps.
Will do.
> > + ret = -EINVAL;
> > + goto unlock;
> > + }
> > +
> > + i = 0;
> > + for (freq = 0;
> > + opp = dev_pm_opp_find_freq_ceil(dev, &freq), !IS_ERR(opp);
> > + freq++) {
> > + u32 freq_mhz, voltage_mv;
> > + u64 power;
> > +
> > + freq_mhz = freq / 1000000;
> > + voltage_mv = dev_pm_opp_get_voltage(opp) / 1000;
> > +
> > + /*
> > + * Do the multiplication with MHz and millivolt so as
> > + * to not overflow.
> > + */
> > + power = (u64)cpufreq_dev->capacitance * freq_mhz *
> > + voltage_mv * voltage_mv;
>
> Essentially: P = C x F x V^2 right? Your model covers for dynamic power
> only.
>
> I believe this power model neglets leakage. Static power consumption
> cannot be negleted nowadays, specially on thermal constrained CPUs.
> (which happens to be the target of this subsystem).
>
> My suggestion is that you model power as:
> P = P_{dyn} + P_{static}
>
> P_{dyn} can be what you propose above, for instance, or some other
> better increasing convex function. But we cannot neglect P_{static}.
For P_{static} we need the current temperature, which is a pain to get
from here without passing more parameters to the cooling device.
That's why I've removed this extension to the cooling device API and
just created another API to deal with getting and setting power.
> I suggest allowing platform device driver writer to input
> power models here somehow. Either by using callbacks to
> implement their own power model, or by feeding the API with
> their own table.
True, the power model needs to be "pluggable" and platforms should be
able to add their own.
> > + do_div(power, 1000000000);
> > +
> > + /* frequency is stored in power_table in KHz */
> > + cpufreq_dev->power_table[i].frequency = freq / 1000;
> > + cpufreq_dev->power_table[i].power = power;
> > +
> > + i++;
> > + }
> > +
> > + if (i == 0) {
> > + ret = PTR_ERR(opp);
> > + goto unlock;
> > + }
> > +
> > + cpufreq_dev->power_table_entries = i;
> > +
> > +unlock:
> > + rcu_read_unlock();
> > + return ret;
> > +}
> > +
> > /**
> > * __cpufreq_cooling_register - helper function to create cpufreq cooling device
> > * @np: a valid struct device_node to the cooling device device tree node
> > * @clip_cpus: cpumask of cpus where the frequency constraints will happen.
> > + * @capacitance: dynamic power coefficient for these cpus
> > *
> > * This interface function registers the cpufreq cooling device with the name
> > * "thermal-cpufreq-%x". This api can support multiple instances of cpufreq
> > @@ -456,7 +722,7 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
> > */
> > static struct thermal_cooling_device *
> > __cpufreq_cooling_register(struct device_node *np,
> > - const struct cpumask *clip_cpus)
> > + const struct cpumask *clip_cpus, u32 capacitance)
> > {
> > struct thermal_cooling_device *cool_dev;
> > struct cpufreq_cooling_device *cpufreq_dev = NULL;
> > @@ -485,6 +751,7 @@ __cpufreq_cooling_register(struct device_node *np,
> > return ERR_PTR(-ENOMEM);
> >
> > cpumask_copy(&cpufreq_dev->allowed_cpus, clip_cpus);
> > + cpufreq_dev->capacitance = capacitance;
> >
> > ret = get_idr(&cpufreq_idr, &cpufreq_dev->id);
> > if (ret) {
> > @@ -531,7 +798,7 @@ __cpufreq_cooling_register(struct device_node *np,
> > struct thermal_cooling_device *
> > cpufreq_cooling_register(const struct cpumask *clip_cpus)
> > {
> > - return __cpufreq_cooling_register(NULL, clip_cpus);
> > + return __cpufreq_cooling_register(NULL, clip_cpus, 0);
> > }
> > EXPORT_SYMBOL_GPL(cpufreq_cooling_register);
> >
> > @@ -555,11 +822,90 @@ of_cpufreq_cooling_register(struct device_node *np,
> > if (!np)
> > return ERR_PTR(-EINVAL);
> >
> > - return __cpufreq_cooling_register(np, clip_cpus);
> > + return __cpufreq_cooling_register(np, clip_cpus, 0);
> > }
> > EXPORT_SYMBOL_GPL(of_cpufreq_cooling_register);
> >
> > /**
> > + * of_cpufreq_power_actor_register - function to create a cpu cooling device with power capabilities
> > + * @np: a valid struct device_node to the cooling device tree node
> > + * @clip_cpus: cpumask of cpus where the frequency constraints will happen
> > + * @capacitance: the dynamic power coefficient for these cpus
> > + *
> > + * Similar to of_cpufreq_cooling_register() this function registers a
> > + * cooling device linked to the device tree node provided. It has a
> > + * simple power model so that get_cur/get_max/set_cur can operate with
> > + * THERMAL_UNIT_POWER
> > + *
> > + * The cpus must have registered their OPPs in the OPP library.
> > + *
> > + * Return: a valid struct thermal_cooling_device pointer on success,
> > + * on failure, it returns a corresponding ERR_PTR().
> > + */
> > +struct thermal_cooling_device *
> > +of_cpufreq_power_actor_register(struct device_node *np,
> > + const struct cpumask *clip_cpus,
> > + u32 capacitance)
> > +{
> > + struct thermal_cooling_device *cdev, *err_ret;
> > + int ret;
> > +
> > + cdev = __cpufreq_cooling_register(np, clip_cpus, capacitance);
> > + if (IS_ERR(cdev))
> > + return cdev;
> > +
> > + ret = build_cpu_power_table(cdev->devdata);
> > + if (ret) {
> > + err_ret = ERR_PTR(ret);
> > + goto unregister;
> > + }
> > +
> > + /*
> > + * You can't register multiple times the same notifier_block.
> > + * The first power actor registered is the only one that
> > + * registers the notifier.
> > + */
> > + if (!power_actor_notifier_registered) {
> > + ret = cpufreq_register_notifier(&cpufreq_transition_notifier,
> > + CPUFREQ_TRANSITION_NOTIFIER);
> > + if (ret) {
> > + err_ret = ERR_PTR(ret);
> > + goto unregister;
> > + }
> > + power_actor_notifier_registered = true;
>
> I believe you need locking to update this global variable.
Yes.
> > + }
> > +
> > + return cdev;
> > +
> > +unregister:
> > + cpufreq_cooling_unregister(cdev);
> > + return err_ret;
> > +}
> > +EXPORT_SYMBOL_GPL(of_cpufreq_power_actor_register);
>
> This needs documentation under Documentation/thermal/sysfs-api.txt.
It's gone, but the new functions will need to be documented, yes.
Thanks,
Javi
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 4/5] thermal: introduce the Power Allocator governor
2014-05-13 23:20 ` Eduardo Valentin
@ 2014-05-15 18:24 ` Javi Merino
2014-05-22 14:26 ` Eduardo Valentin
0 siblings, 1 reply; 17+ messages in thread
From: Javi Merino @ 2014-05-15 18:24 UTC (permalink / raw)
To: Eduardo Valentin; +Cc: linux-pm@vger.kernel.org, Punit Agrawal, Zhang Rui
Hi Eduardo,
On Wed, May 14, 2014 at 12:20:41AM +0100, Eduardo Valentin wrote:
> Hello Javi,
>
> Here are a series of initial comments. I will be probably reviewing the
> code several times and coming back again with more questions.
Ok, we'll send updates regularly as we incorporate feedback and tune
it.
> On Tue, May 06, 2014 at 01:06:37PM +0100, Javi Merino wrote:
> > The power allocator governor is a thermal governor that controls system
> > and device power allocation to control temperature. Conceptually, the
> > implementation takes a system view of heat dissipation by managing
> > multiple heat sources.
>
> Can it be able to control several thermal zones then?
Yes, that's why we added bind_to_tz() and unbind_to_tz() to put the
governor's private data in the struct thermal_zone_device.
> > This governor relies on power-aware cooling devices (power actors) to
> > operate. That is, cooling devices whose thermal_cooling_device_ops
> > accept THERMAL_UNIT_POWER.
> >
> > It uses a Proportional Integral (PI) controller driven by the
> > temperature of the thermal zone. This budget is then allocated to
> > each cooling device that can have bearing on the temperature we are
> > trying to control. It decides how much power to give each cooling
> > device based on the performance they are requesting. The PI
> > controller ensures that the total power budget does not exceed the
> > control temperature.
> >
> > Cc: Zhang Rui <rui.zhang@intel.com>
> > Cc: Eduardo Valentin <edubezval@gmail.com>
> > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> > Signed-off-by: Javi Merino <javi.merino@arm.com>
> > ---
> > drivers/thermal/Kconfig | 14 ++
> > drivers/thermal/Makefile | 1 +
> > drivers/thermal/power_allocator.c | 403 +++++++++++++++++++++++++++++++++++++
> > drivers/thermal/thermal_core.c | 7 +-
> > drivers/thermal/thermal_core.h | 8 +
> > include/linux/thermal.h | 5 +
>
> We need documentation under Documentation/thermal to understand
> how to use this governor. Specially on how to accomodate
> several actors and domains under its control.
Sure, the next series will have some documentation on how to integrate
it in a platform and how all the components interact.
> > 6 files changed, 437 insertions(+), 1 deletion(-)
> > create mode 100644 drivers/thermal/power_allocator.c
> >
> > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> > index 2d51912a6e40..f5e6b693fc19 100644
> > --- a/drivers/thermal/Kconfig
> > +++ b/drivers/thermal/Kconfig
> > @@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
> > Select this if you want to let the user space manage the
> > platform thermals.
> >
> > +config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
> > + bool "power_allocator"
> > + select THERMAL_GOV_POWER_ALLOCATOR
> > + help
> > + Select this if you want to control temperature based on
> > + system and device power allocation. This governor relies on
> > + power-aware cooling devices (power actors) to operate.
> > +
> > endchoice
> >
> > config THERMAL_GOV_FAIR_SHARE
> > @@ -89,6 +97,12 @@ config THERMAL_GOV_USER_SPACE
> > help
> > Enable this to let the user space manage the platform thermals.
> >
> > +config THERMAL_GOV_POWER_ALLOCATOR
> > + bool "Power allocator thermal governor"
> > + help
> > + Enable this to manage platform thermals by dynamically
> > + allocating and limiting power to devices.
> > +
> > config CPU_THERMAL
> > bool "generic cpu cooling support"
> > depends on CPU_FREQ
> > diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
> > index 54e4ec9eb5df..1cb7d5fec8dc 100644
> > --- a/drivers/thermal/Makefile
> > +++ b/drivers/thermal/Makefile
> > @@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF) += of-thermal.o
> > thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE) += fair_share.o
> > thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE) += step_wise.o
> > thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE) += user_space.o
> > +thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR) += power_allocator.o
> >
> > # cpufreq cooling
> > thermal_sys-$(CONFIG_CPU_THERMAL) += cpu_cooling.o
> > diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
> > new file mode 100644
> > index 000000000000..39c425abf10a
> > --- /dev/null
> > +++ b/drivers/thermal/power_allocator.c
> > @@ -0,0 +1,403 @@
> > +/*
> > + * A power allocator to manage temperature
> > + *
> > + * Copyright (C) 2014 ARM Ltd.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> > + * kind, whether express or implied; without even the implied warranty
> > + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#define pr_fmt(fmt) "Power allocator: " fmt
> > +
> > +#include <linux/list.h>
> > +#include <linux/slab.h>
> > +#include <linux/thermal.h>
> > +
> > +#include "thermal_core.h"
> > +
> > +#define MAX_NUM_ACTORS 8
>
> hmm... Why do we need to limit the amount of actors?
Right, I could allocate the arrays dynamically.
> > +
> > +#define FRAC_BITS 8
> > +#define int_to_frac(x) ((x) << FRAC_BITS)
> > +#define frac_to_int(x) ((x) >> FRAC_BITS)
> > +
> > +/**
> > + * mul_frac - multiply two fixed-point numbers
> > + * @x: first multiplicand
> > + * @y: second multiplicand
> > + *
> > + * Returns the result of multiplying two fixed-point numbers. The
> > + * result is also a fixed-point number.
> > + */
> > +static inline s64 mul_frac(s64 x, s64 y)
> > +{
> > + return (x * y) >> FRAC_BITS;
> > +}
> > +
> > +enum power_allocator_trip_levels {
> > + TRIP_SWITCH_ON = 0, /* Switch on PI controller */
> > + TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
> > +};
>
> Can we reuse the trip types we already have?
You mean using "passive" as a "switch on" and "active" as
MAX_DESIRED_TEMPERATURE? If you prefer it that way, I can do that.
> > +
> > +/**
> > + * struct power_allocator_params - parameters for the power allocator governor
> > + * @k_po: P parameter of the PI controller when overshooting (i.e., when
> > + * temperature is below the target)
> > + * @k_pi: P parameter of the PI controller when undershooting
> > + * @k_i: I parameter of the PI controller
> > + * @integral_cutoff: threshold below which the error is no longer accumulated
> > + in the PI controller
> > + * @err_integral: Accumulated error in the PI controller.
> > + */
> > +struct power_allocator_params {
> > + s32 k_po;
> > + s32 k_pu;
> > + s32 k_i;
> > + s32 integral_cutoff;
> > + s32 err_integral;
> > +};
>
> We would need a documentation on how to define these values.
Originally I thought that this should not need to be exposed to the
outside world and could have defaults that worked everywhere. I guess
people will like to fine tune them, so apart from the sane defaults,
there should be a way for platforms to override this, and that would
need documentation.
> > +
> > +/**
> > + * pi_controller() - PI controller
> > + * @tz: thermal zone we are operating in
> > + * @control_temp: The target temperature
> > + * @max_allocatable_power: maximum allocatable power for this thermal zone
> > + *
> > + * This PI controller increases the available power budget so that the
> > + * temperature of the thermal zone gets as close as possible to
> > + * @control_temp and limits the power if it exceeds it. k_po is the
> > + * proportional term when we are overshooting, k_pu is the
> > + * proportional term when we are undershooting. integral_cutoff is a
> > + * threshold below which we stop accumulating the error. The
> > + * accumulated error is only valid if the requested power will make
> > + * the system warmer. If the system is mostly idle, there's no point
> > + * in accumulating positive error.
> > + *
> > + * It returns the power budget for the next period.
> > + */
> > +static u32 pi_controller(struct thermal_zone_device *tz,
> > + unsigned long current_temp, unsigned long control_temp,
> > + unsigned long max_allocatable_power)
> > +{
> > + s64 p, i, power_range;
> > + s32 err;
> > + struct power_allocator_params *params = tz->governor_data;
> > +
> > + err = ((s32)control_temp - (s32)current_temp) / 1000;
> > + err = int_to_frac(err);
> > +
> > + /* Calculate the proportional term */
> > + p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
> > +
> > + /*
> > + * Calculate the integral term
> > + *
> > + * if the error s less than cut off allow integration (but
> > + * the integral is limited to max power)
> > + */
> > + i = mul_frac(params->k_i, params->err_integral);
> > +
> > + if (err < int_to_frac(params->integral_cutoff)) {
> > + s64 tmpi = mul_frac(params->k_i, err);
> > + tmpi += i;
> > + if (tmpi <= int_to_frac(max_allocatable_power)) {
> > + i = tmpi;
> > + params->err_integral += err;
> > + }
> > + }
> > +
> > + power_range = p + i;
> > +
> > + /* feed-forward the known maximum dissipatable power */
> > + power_range = tz->tzp->max_dissipatable_power +
> > + frac_to_int(power_range);
> > +
> > + /* output should not exceed max power */
> > + if (power_range > max_allocatable_power)
> > + power_range = max_allocatable_power;
> > +
> > + return power_range >= 0 ? power_range : 0;
>
> What does it mean return 0? The above if does not make completelly sense
> to me, as power_range can still be 0. power_range < 0 means error?
You return 0 if the calculated power range was negative. I've
rewritten the above code to be:
return clamp(power_range, (s64)0, (s64)max_allocatable_power);
which I think it's clear.
> what if the value at power_range overflows the u32?
Its maximum value is max_allocatable_power which in sane systems
shouldn't be greater than u32. I could put a BUG() somewhere if
max_allocatable_power > U32_MAX, do you think it's worth it?
> > +}
> > +
> > +static void divvy_up_power(struct thermal_zone_device *tz,
>
> Can you please document how you perform this power split?
Will do.
> > + unsigned long req_power[MAX_NUM_ACTORS],
> > + unsigned long max_power[MAX_NUM_ACTORS],
> > + int num_actors, unsigned long total_req_power,
> > + u32 power_range,
> > + unsigned long granted_power[MAX_NUM_ACTORS])
> > +{
> > + unsigned long extra_power, capped_extra_power;
> > + unsigned long extra_actor_power[MAX_NUM_ACTORS];
> > + int i;
> > +
> > + if (!total_req_power) {
> > + /*
> > + * Nobody requested anything, so just give everybody
> > + * the maximum power
> > + */
> > + for (i = 0; i < num_actors; i++)
> > + granted_power[i] = max_power[i];
> > +
> > + return;
> > + }
> > +
> > + capped_extra_power = 0;
> > + extra_power = 0;
> > + for (i = 0; i < num_actors; i++) {
> > + u64 req_range = req_power[i] * power_range;
> > +
> > + granted_power[i] = div_u64(req_range, total_req_power);
> > +
> > + if (granted_power[i] > max_power[i]) {
> > + extra_power += granted_power[i] - max_power[i];
> > + granted_power[i] = max_power[i];
> > + }
> > +
> > + extra_actor_power[i] = max_power[i] - granted_power[i];
> > + capped_extra_power += extra_actor_power[i];
> > + }
> > +
> > + if (!extra_power)
> > + return;
> > +
> > + /*
> > + * Re-divvy the reclaimed extra among actors based on
> > + * how far they are from the max
> > + */
> > + extra_power = min(extra_power, capped_extra_power);
> > + if (capped_extra_power > 0)
> > + for (i = 0; i < num_actors; i++)
> > + granted_power[i] += (extra_actor_power[i] *
> > + extra_power) / capped_extra_power;
> > +}
> > +
> > +static int allocate_power(struct thermal_zone_device *tz,
> > + unsigned long current_temp, unsigned long control_temp)
> > +{
> > + struct thermal_instance *instance;
> > + unsigned long req_power[MAX_NUM_ACTORS], max_power[MAX_NUM_ACTORS];
> > + unsigned long granted_power[MAX_NUM_ACTORS];
> > + unsigned long total_req_power, max_allocatable_power;
> > + u32 power_range;
> > + int i, num_actors, ret = 0;
> > +
> > + i = 0;
> > + total_req_power = 0;
> > + max_allocatable_power = 0;
> > +
> > + mutex_lock(&tz->lock);
> > +
> > + list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> > + struct thermal_cooling_device *cdev = instance->cdev;
> > +
> > + if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
>
> How does one instance->trip gets the value of
> TRIP_MAX_DESIRED_TEMPERATURE?
In power_allocator_bind() you fail to bind if the trip point doesn't
exist. This may be removed if we switch passive/active trip points.
> > + continue;
> > +
> > + if (i >= MAX_NUM_ACTORS) {
> > + pr_warn("Too many actors (%d) for this governor\n", i);
>
> dev_* family is preferrable over pr_* family. Same applies to other
> patches.
With the &tz->device as the device? Sure.
> > + ret = -EINVAL;
> > + goto unlock;
> > + }
> > +
> > + ret = cdev->ops->get_cur(cdev, &req_power[i],
> > + THERMAL_UNIT_POWER);
> > + if (ret) {
> > + pr_warn_once("Couldn't get current power for %s: %d\n",
> > + cdev->type, ret);
> > + goto unlock;
> > + }
> > +
> > + total_req_power += req_power[i];
> > +
> > + cdev->ops->get_max(cdev, &max_power[i], THERMAL_UNIT_POWER);
> > + max_allocatable_power += max_power[i];
> > +
> > + i++;
> > + }
> > + num_actors = i;
> > +
> > + power_range = pi_controller(tz, current_temp, control_temp,
> > + max_allocatable_power);
> > +
> > + divvy_up_power(tz, req_power, max_power, num_actors, total_req_power,
> > + power_range, granted_power);
> > +
> > + i = 0;
> > + list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> > + struct thermal_cooling_device *cdev = instance->cdev;
> > +
> > + if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
> > + continue;
> > +
> > + cdev->ops->set_cur(cdev, granted_power[i], THERMAL_UNIT_POWER);
> > + i++;
> > + }
> > +
> > +unlock:
> > + mutex_unlock(&tz->lock);
> > +
> > + return ret;
> > +}
> > +
> > +static int check_trips(struct thermal_zone_device *tz)
> > +{
> > + int ret;
> > + enum thermal_trip_type type;
> > +
> > + if (tz->trips < 2)
> > + return -EINVAL;
> > +
> > + ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
> > + if (ret)
> > + return ret;
> > +
> > + if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> > + return -EINVAL;
> > +
> > + ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
> > + if (ret)
> > + return ret;
> > +
> > + if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> > + return -EINVAL;
> > +
> > + return ret;
> > +}
> > +
> > +static void reset_pi_controller(struct power_allocator_params *params)
> > +{
> > + params->err_integral = 0;
> > +}
> > +
> > +/**
> > + * power_allocator_bind - bind the power_allocator governor to a thermal zone
> > + * @tz: thermal zone to bind it to
> > + *
> > + * Check that the thermal zone is valid for this governor: has two
> > + * thermal trips. If so, initialize the PI controller parameters and
> > + * bind it to the thermal zone.
> > + *
> > + * Returns 0 on success, -EINVAL if the trips were invalid or -ENOMEM
> > + * if we ran out of memory.
> > + */
> > +static int power_allocator_bind(struct thermal_zone_device *tz)
> > +{
> > + int ret;
> > + struct power_allocator_params *params;
> > + unsigned long switch_on_temp, control_temp;
> > + u32 temperature_threshold;
> > +
> > + ret = check_trips(tz);
> > + if (ret) {
> > + pr_err("thermal zone %s has the wrong number of trips for this governor\n",
> > + tz->type);
> > + return ret;
> > + }
> > +
> > + if (!tz->tzp || !tz->tzp->max_dissipatable_power) {
> > + pr_err("Trying to bind the power_allocator governor to a thermal zone without the max_dissipatable_power parameter\n");
>
> Can you please break the above line?
I prefer to keep it as it is. From Documentation/CodingStyle:
never break user-visible strings such as printk messages, because
that breaks the ability to grep for them.
The string could be improved though.
> > + return -EINVAL;
> > + }
> > +
> > + params = kzalloc(sizeof(*params), GFP_KERNEL);
>
> can we use devm_kzalloc?
Sure.
> > + if (!params)
> > + return -ENOMEM;
> > +
> > + ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> > + if (ret)
> > + goto free;
> > +
> > + ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> > + &control_temp);
>
> thermal zone drivers are not aware of these trip types, are they?
I'll do what you suggested before and turn them into passive/active.
> > + if (ret)
> > + goto free;
> > +
> > + temperature_threshold = (control_temp - switch_on_temp) / 1000;
> > +
> > + params->k_po = int_to_frac(tz->tzp->max_dissipatable_power) /
> > + temperature_threshold;
> > + params->k_pu = int_to_frac(2 * tz->tzp->max_dissipatable_power) /
> > + temperature_threshold;
> > + params->k_i = int_to_frac(1);
>
> I suppose we need a way to input k_po, k_pu and k_i parameters from
> platform code to this governor right? Or are they always defined as the
> above code? always based on max_dissipated_power?
As I said above, I thought that we could just provide defaults and not
make life harder for platform code, but I guess we need to provide a
way for platform to override them.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 4/5] thermal: introduce the Power Allocator governor
2014-05-15 18:24 ` Javi Merino
@ 2014-05-22 14:26 ` Eduardo Valentin
0 siblings, 0 replies; 17+ messages in thread
From: Eduardo Valentin @ 2014-05-22 14:26 UTC (permalink / raw)
To: Javi Merino; +Cc: linux-pm@vger.kernel.org, Punit Agrawal, Zhang Rui
Hello again Javi,
I will answer couple of your questions here. But I will also review your
V2, no worries.
On Thu, May 15, 2014 at 07:24:44PM +0100, Javi Merino wrote:
> Hi Eduardo,
>
> On Wed, May 14, 2014 at 12:20:41AM +0100, Eduardo Valentin wrote:
> > Hello Javi,
> >
> > Here are a series of initial comments. I will be probably reviewing the
> > code several times and coming back again with more questions.
>
> Ok, we'll send updates regularly as we incorporate feedback and tune
> it.
great!
>
> > On Tue, May 06, 2014 at 01:06:37PM +0100, Javi Merino wrote:
> > > The power allocator governor is a thermal governor that controls system
> > > and device power allocation to control temperature. Conceptually, the
> > > implementation takes a system view of heat dissipation by managing
> > > multiple heat sources.
> >
> > Can it be able to control several thermal zones then?
>
> Yes, that's why we added bind_to_tz() and unbind_to_tz() to put the
> governor's private data in the struct thermal_zone_device.
I see. But I was more interested if you meant that this governor is
capable of controling several thermal zones at the same time. That is,
builds decision based on information taken from several thermal zones.
Was that what you meant?
>
> > > This governor relies on power-aware cooling devices (power actors) to
> > > operate. That is, cooling devices whose thermal_cooling_device_ops
> > > accept THERMAL_UNIT_POWER.
> > >
> > > It uses a Proportional Integral (PI) controller driven by the
> > > temperature of the thermal zone. This budget is then allocated to
> > > each cooling device that can have bearing on the temperature we are
> > > trying to control. It decides how much power to give each cooling
> > > device based on the performance they are requesting. The PI
> > > controller ensures that the total power budget does not exceed the
> > > control temperature.
> > >
> > > Cc: Zhang Rui <rui.zhang@intel.com>
> > > Cc: Eduardo Valentin <edubezval@gmail.com>
> > > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> > > Signed-off-by: Javi Merino <javi.merino@arm.com>
> > > ---
> > > drivers/thermal/Kconfig | 14 ++
> > > drivers/thermal/Makefile | 1 +
> > > drivers/thermal/power_allocator.c | 403 +++++++++++++++++++++++++++++++++++++
> > > drivers/thermal/thermal_core.c | 7 +-
> > > drivers/thermal/thermal_core.h | 8 +
> > > include/linux/thermal.h | 5 +
> >
> > We need documentation under Documentation/thermal to understand
> > how to use this governor. Specially on how to accomodate
> > several actors and domains under its control.
>
> Sure, the next series will have some documentation on how to integrate
> it in a platform and how all the components interact.
>
ok.
> > > 6 files changed, 437 insertions(+), 1 deletion(-)
> > > create mode 100644 drivers/thermal/power_allocator.c
> > >
> > > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> > > index 2d51912a6e40..f5e6b693fc19 100644
> > > --- a/drivers/thermal/Kconfig
> > > +++ b/drivers/thermal/Kconfig
> > > @@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
> > > Select this if you want to let the user space manage the
> > > platform thermals.
> > >
> > > +config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
> > > + bool "power_allocator"
> > > + select THERMAL_GOV_POWER_ALLOCATOR
> > > + help
> > > + Select this if you want to control temperature based on
> > > + system and device power allocation. This governor relies on
> > > + power-aware cooling devices (power actors) to operate.
> > > +
> > > endchoice
> > >
> > > config THERMAL_GOV_FAIR_SHARE
> > > @@ -89,6 +97,12 @@ config THERMAL_GOV_USER_SPACE
> > > help
> > > Enable this to let the user space manage the platform thermals.
> > >
> > > +config THERMAL_GOV_POWER_ALLOCATOR
> > > + bool "Power allocator thermal governor"
> > > + help
> > > + Enable this to manage platform thermals by dynamically
> > > + allocating and limiting power to devices.
> > > +
> > > config CPU_THERMAL
> > > bool "generic cpu cooling support"
> > > depends on CPU_FREQ
> > > diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
> > > index 54e4ec9eb5df..1cb7d5fec8dc 100644
> > > --- a/drivers/thermal/Makefile
> > > +++ b/drivers/thermal/Makefile
> > > @@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF) += of-thermal.o
> > > thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE) += fair_share.o
> > > thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE) += step_wise.o
> > > thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE) += user_space.o
> > > +thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR) += power_allocator.o
> > >
> > > # cpufreq cooling
> > > thermal_sys-$(CONFIG_CPU_THERMAL) += cpu_cooling.o
> > > diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
> > > new file mode 100644
> > > index 000000000000..39c425abf10a
> > > --- /dev/null
> > > +++ b/drivers/thermal/power_allocator.c
> > > @@ -0,0 +1,403 @@
> > > +/*
> > > + * A power allocator to manage temperature
> > > + *
> > > + * Copyright (C) 2014 ARM Ltd.
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + *
> > > + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> > > + * kind, whether express or implied; without even the implied warranty
> > > + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > > + * GNU General Public License for more details.
> > > + */
> > > +
> > > +#define pr_fmt(fmt) "Power allocator: " fmt
> > > +
> > > +#include <linux/list.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/thermal.h>
> > > +
> > > +#include "thermal_core.h"
> > > +
> > > +#define MAX_NUM_ACTORS 8
> >
> > hmm... Why do we need to limit the amount of actors?
>
> Right, I could allocate the arrays dynamically.
good.
>
> > > +
> > > +#define FRAC_BITS 8
> > > +#define int_to_frac(x) ((x) << FRAC_BITS)
> > > +#define frac_to_int(x) ((x) >> FRAC_BITS)
> > > +
> > > +/**
> > > + * mul_frac - multiply two fixed-point numbers
> > > + * @x: first multiplicand
> > > + * @y: second multiplicand
> > > + *
> > > + * Returns the result of multiplying two fixed-point numbers. The
> > > + * result is also a fixed-point number.
> > > + */
> > > +static inline s64 mul_frac(s64 x, s64 y)
> > > +{
> > > + return (x * y) >> FRAC_BITS;
> > > +}
> > > +
> > > +enum power_allocator_trip_levels {
> > > + TRIP_SWITCH_ON = 0, /* Switch on PI controller */
> > > + TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
> > > +};
> >
> > Can we reuse the trip types we already have?
>
> You mean using "passive" as a "switch on" and "active" as
> MAX_DESIRED_TEMPERATURE? If you prefer it that way, I can do that.
hmmm.. well, "active", as per its original definition, targets points in
which the system performs (cooling) actions that burns more power to
remove heat. On the other hand, "passive" removes heat by reducing/modulating system's power consumption.
I believe you would need two passive trip points.
Another option is to use a passive trip point to map your switch on
event and a hot trip point to map your target temperature.
>
> > > +
> > > +/**
> > > + * struct power_allocator_params - parameters for the power allocator governor
> > > + * @k_po: P parameter of the PI controller when overshooting (i.e., when
> > > + * temperature is below the target)
> > > + * @k_pi: P parameter of the PI controller when undershooting
> > > + * @k_i: I parameter of the PI controller
> > > + * @integral_cutoff: threshold below which the error is no longer accumulated
> > > + in the PI controller
> > > + * @err_integral: Accumulated error in the PI controller.
> > > + */
> > > +struct power_allocator_params {
> > > + s32 k_po;
> > > + s32 k_pu;
> > > + s32 k_i;
> > > + s32 integral_cutoff;
> > > + s32 err_integral;
> > > +};
> >
> > We would need a documentation on how to define these values.
>
> Originally I thought that this should not need to be exposed to the
> outside world and could have defaults that worked everywhere. I guess
> people will like to fine tune them, so apart from the sane defaults,
> there should be a way for platforms to override this, and that would
> need documentation.
Can we achieve safe defaults applicable to every platforms?
>
> > > +
> > > +/**
> > > + * pi_controller() - PI controller
> > > + * @tz: thermal zone we are operating in
> > > + * @control_temp: The target temperature
> > > + * @max_allocatable_power: maximum allocatable power for this thermal zone
> > > + *
> > > + * This PI controller increases the available power budget so that the
> > > + * temperature of the thermal zone gets as close as possible to
> > > + * @control_temp and limits the power if it exceeds it. k_po is the
> > > + * proportional term when we are overshooting, k_pu is the
> > > + * proportional term when we are undershooting. integral_cutoff is a
> > > + * threshold below which we stop accumulating the error. The
> > > + * accumulated error is only valid if the requested power will make
> > > + * the system warmer. If the system is mostly idle, there's no point
> > > + * in accumulating positive error.
> > > + *
> > > + * It returns the power budget for the next period.
> > > + */
> > > +static u32 pi_controller(struct thermal_zone_device *tz,
> > > + unsigned long current_temp, unsigned long control_temp,
> > > + unsigned long max_allocatable_power)
> > > +{
> > > + s64 p, i, power_range;
> > > + s32 err;
> > > + struct power_allocator_params *params = tz->governor_data;
> > > +
> > > + err = ((s32)control_temp - (s32)current_temp) / 1000;
> > > + err = int_to_frac(err);
> > > +
> > > + /* Calculate the proportional term */
> > > + p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
> > > +
> > > + /*
> > > + * Calculate the integral term
> > > + *
> > > + * if the error s less than cut off allow integration (but
> > > + * the integral is limited to max power)
> > > + */
> > > + i = mul_frac(params->k_i, params->err_integral);
> > > +
> > > + if (err < int_to_frac(params->integral_cutoff)) {
> > > + s64 tmpi = mul_frac(params->k_i, err);
> > > + tmpi += i;
> > > + if (tmpi <= int_to_frac(max_allocatable_power)) {
> > > + i = tmpi;
> > > + params->err_integral += err;
> > > + }
> > > + }
> > > +
> > > + power_range = p + i;
> > > +
> > > + /* feed-forward the known maximum dissipatable power */
> > > + power_range = tz->tzp->max_dissipatable_power +
> > > + frac_to_int(power_range);
> > > +
> > > + /* output should not exceed max power */
> > > + if (power_range > max_allocatable_power)
> > > + power_range = max_allocatable_power;
> > > +
> > > + return power_range >= 0 ? power_range : 0;
> >
> > What does it mean return 0? The above if does not make completelly sense
> > to me, as power_range can still be 0. power_range < 0 means error?
>
> You return 0 if the calculated power range was negative. I've
> rewritten the above code to be:
>
> return clamp(power_range, (s64)0, (s64)max_allocatable_power);
>
> which I think it's clear.
sounds good to me.
>
> > what if the value at power_range overflows the u32?
>
> Its maximum value is max_allocatable_power which in sane systems
> shouldn't be greater than u32. I could put a BUG() somewhere if
> max_allocatable_power > U32_MAX, do you think it's worth it?
>
I agree to have a sane check at least in initialization phase to avoid
obscure overflow errors.
> > > +}
> > > +
> > > +static void divvy_up_power(struct thermal_zone_device *tz,
> >
> > Can you please document how you perform this power split?
>
> Will do.
good.
>
> > > + unsigned long req_power[MAX_NUM_ACTORS],
> > > + unsigned long max_power[MAX_NUM_ACTORS],
> > > + int num_actors, unsigned long total_req_power,
> > > + u32 power_range,
> > > + unsigned long granted_power[MAX_NUM_ACTORS])
> > > +{
> > > + unsigned long extra_power, capped_extra_power;
> > > + unsigned long extra_actor_power[MAX_NUM_ACTORS];
> > > + int i;
> > > +
> > > + if (!total_req_power) {
> > > + /*
> > > + * Nobody requested anything, so just give everybody
> > > + * the maximum power
> > > + */
> > > + for (i = 0; i < num_actors; i++)
> > > + granted_power[i] = max_power[i];
> > > +
> > > + return;
> > > + }
> > > +
> > > + capped_extra_power = 0;
> > > + extra_power = 0;
> > > + for (i = 0; i < num_actors; i++) {
> > > + u64 req_range = req_power[i] * power_range;
> > > +
> > > + granted_power[i] = div_u64(req_range, total_req_power);
> > > +
> > > + if (granted_power[i] > max_power[i]) {
> > > + extra_power += granted_power[i] - max_power[i];
> > > + granted_power[i] = max_power[i];
> > > + }
> > > +
> > > + extra_actor_power[i] = max_power[i] - granted_power[i];
> > > + capped_extra_power += extra_actor_power[i];
> > > + }
> > > +
> > > + if (!extra_power)
> > > + return;
> > > +
> > > + /*
> > > + * Re-divvy the reclaimed extra among actors based on
> > > + * how far they are from the max
> > > + */
> > > + extra_power = min(extra_power, capped_extra_power);
> > > + if (capped_extra_power > 0)
> > > + for (i = 0; i < num_actors; i++)
> > > + granted_power[i] += (extra_actor_power[i] *
> > > + extra_power) / capped_extra_power;
> > > +}
> > > +
> > > +static int allocate_power(struct thermal_zone_device *tz,
> > > + unsigned long current_temp, unsigned long control_temp)
> > > +{
> > > + struct thermal_instance *instance;
> > > + unsigned long req_power[MAX_NUM_ACTORS], max_power[MAX_NUM_ACTORS];
> > > + unsigned long granted_power[MAX_NUM_ACTORS];
> > > + unsigned long total_req_power, max_allocatable_power;
> > > + u32 power_range;
> > > + int i, num_actors, ret = 0;
> > > +
> > > + i = 0;
> > > + total_req_power = 0;
> > > + max_allocatable_power = 0;
> > > +
> > > + mutex_lock(&tz->lock);
> > > +
> > > + list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> > > + struct thermal_cooling_device *cdev = instance->cdev;
> > > +
> > > + if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
> >
> > How does one instance->trip gets the value of
> > TRIP_MAX_DESIRED_TEMPERATURE?
>
> In power_allocator_bind() you fail to bind if the trip point doesn't
> exist. This may be removed if we switch passive/active trip points.
yes, but you need a check while binding to a correct trip setup, I
agree.
>
> > > + continue;
> > > +
> > > + if (i >= MAX_NUM_ACTORS) {
> > > + pr_warn("Too many actors (%d) for this governor\n", i);
> >
> > dev_* family is preferrable over pr_* family. Same applies to other
> > patches.
>
> With the &tz->device as the device? Sure.
yes, use the thermal zone device when possible. If thermal zone device
is unreliable, then use pr_* family.
>
> > > + ret = -EINVAL;
> > > + goto unlock;
> > > + }
> > > +
> > > + ret = cdev->ops->get_cur(cdev, &req_power[i],
> > > + THERMAL_UNIT_POWER);
> > > + if (ret) {
> > > + pr_warn_once("Couldn't get current power for %s: %d\n",
> > > + cdev->type, ret);
> > > + goto unlock;
> > > + }
> > > +
> > > + total_req_power += req_power[i];
> > > +
> > > + cdev->ops->get_max(cdev, &max_power[i], THERMAL_UNIT_POWER);
> > > + max_allocatable_power += max_power[i];
> > > +
> > > + i++;
> > > + }
> > > + num_actors = i;
> > > +
> > > + power_range = pi_controller(tz, current_temp, control_temp,
> > > + max_allocatable_power);
> > > +
> > > + divvy_up_power(tz, req_power, max_power, num_actors, total_req_power,
> > > + power_range, granted_power);
> > > +
> > > + i = 0;
> > > + list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> > > + struct thermal_cooling_device *cdev = instance->cdev;
> > > +
> > > + if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
> > > + continue;
> > > +
> > > + cdev->ops->set_cur(cdev, granted_power[i], THERMAL_UNIT_POWER);
> > > + i++;
> > > + }
> > > +
> > > +unlock:
> > > + mutex_unlock(&tz->lock);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +static int check_trips(struct thermal_zone_device *tz)
> > > +{
> > > + int ret;
> > > + enum thermal_trip_type type;
> > > +
> > > + if (tz->trips < 2)
> > > + return -EINVAL;
> > > +
> > > + ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> > > + return -EINVAL;
> > > +
> > > + ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> > > + return -EINVAL;
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +static void reset_pi_controller(struct power_allocator_params *params)
> > > +{
> > > + params->err_integral = 0;
> > > +}
> > > +
> > > +/**
> > > + * power_allocator_bind - bind the power_allocator governor to a thermal zone
> > > + * @tz: thermal zone to bind it to
> > > + *
> > > + * Check that the thermal zone is valid for this governor: has two
> > > + * thermal trips. If so, initialize the PI controller parameters and
> > > + * bind it to the thermal zone.
> > > + *
> > > + * Returns 0 on success, -EINVAL if the trips were invalid or -ENOMEM
> > > + * if we ran out of memory.
> > > + */
> > > +static int power_allocator_bind(struct thermal_zone_device *tz)
> > > +{
> > > + int ret;
> > > + struct power_allocator_params *params;
> > > + unsigned long switch_on_temp, control_temp;
> > > + u32 temperature_threshold;
> > > +
> > > + ret = check_trips(tz);
> > > + if (ret) {
> > > + pr_err("thermal zone %s has the wrong number of trips for this governor\n",
> > > + tz->type);
> > > + return ret;
> > > + }
> > > +
> > > + if (!tz->tzp || !tz->tzp->max_dissipatable_power) {
> > > + pr_err("Trying to bind the power_allocator governor to a thermal zone without the max_dissipatable_power parameter\n");
> >
> > Can you please break the above line?
>
> I prefer to keep it as it is. From Documentation/CodingStyle:
>
> never break user-visible strings such as printk messages, because
> that breaks the ability to grep for them.
>
> The string could be improved though.
Yeah, but the above message is not too descriptive for the system
administrator. Think of it, it does mention an error situation, but does
not mention what is the thermal zone.
Maybe a smaller message like this would be more helpfull:
+ if (!tz->tzp || !tz->tzp->max_dissipatable_power) {
+ dev_err(tz->device,
+ "power_allocator: missing max_dissipatable_power\n");
The dev_err includes the name of the thermal zone device.
>
> > > + return -EINVAL;
> > > + }
> > > +
> > > + params = kzalloc(sizeof(*params), GFP_KERNEL);
> >
> > can we use devm_kzalloc?
>
> Sure.
>
good.
> > > + if (!params)
> > > + return -ENOMEM;
> > > +
> > > + ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> > > + if (ret)
> > > + goto free;
> > > +
> > > + ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> > > + &control_temp);
> >
> > thermal zone drivers are not aware of these trip types, are they?
>
> I'll do what you suggested before and turn them into passive/active.
good.
>
> > > + if (ret)
> > > + goto free;
> > > +
> > > + temperature_threshold = (control_temp - switch_on_temp) / 1000;
> > > +
> > > + params->k_po = int_to_frac(tz->tzp->max_dissipatable_power) /
> > > + temperature_threshold;
> > > + params->k_pu = int_to_frac(2 * tz->tzp->max_dissipatable_power) /
> > > + temperature_threshold;
> > > + params->k_i = int_to_frac(1);
> >
> > I suppose we need a way to input k_po, k_pu and k_i parameters from
> > platform code to this governor right? Or are they always defined as the
> > above code? always based on max_dissipated_power?
>
> As I said above, I thought that we could just provide defaults and not
> make life harder for platform code, but I guess we need to provide a
> way for platform to override them.
>
I see. If we can achieve a single set of constants, then we can add
debugfs entries for k_{po,pu,i} constant values. That would allow for
debugging and fine tunning.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2014-05-22 14:27 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
2014-05-06 12:06 ` [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone Javi Merino
2014-05-13 14:31 ` edubezval
2014-05-15 14:30 ` Javi Merino
2014-05-15 15:10 ` Eduardo Valentin
2014-05-06 12:06 ` [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state Javi Merino
2014-05-13 15:22 ` edubezval
2014-05-06 12:06 ` [RFC PATCH 3/5] thermal: add a basic cpu power actor Javi Merino
2014-05-13 22:57 ` Eduardo Valentin
2014-05-15 17:02 ` Javi Merino
2014-05-06 12:06 ` [RFC PATCH 4/5] thermal: introduce the Power Allocator governor Javi Merino
2014-05-13 23:20 ` Eduardo Valentin
2014-05-15 18:24 ` Javi Merino
2014-05-22 14:26 ` Eduardo Valentin
2014-05-06 12:06 ` [RFC PATCH 5/5] thermal: add trace events to the power allocator governor Javi Merino
2014-05-06 12:28 ` Steven Rostedt
2014-05-14 10:05 ` [RFC PATCH 0/5] The power allocator thermal governor James King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).