linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/7] The power allocator thermal governor
@ 2014-05-20 14:10 Javi Merino
  2014-05-20 14:10 ` [RFC PATCH v2 1/7] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks Javi Merino
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Javi Merino @ 2014-05-20 14:10 UTC (permalink / raw)
  To: linux-pm, linux-kernel; +Cc: Punit.Agrawal, Javi Merino

Hi linux-pm,

This is v2 of the RFC we sent in [1].  The power allocator governor
allocates device power to control temperature.  This requires
transforming performance requests into requested power, which we do
with the aid of power models.  Patch 5 (thermal: add a basic cpu power
actor) implements a simple power model for cpus.  The division of
power between the actors ensures that power is allocated where it is
needed the most, based on the current workload.

[1] http://thread.gmane.org/gmane.linux.power-management.general/45000 

Patches 1 and 2 are not proper parts of these series and can be merged
separately.  Patch 1 (tracing: Add __bitmask() macro to trace events
to cpumasks and other bitmasks) is already in for-next.  Patch 2
(thermal: document struct thermal_zone_device and thermal_governor)
has already been submitted to linux-pm[2] and is generic.

[2] http://article.gmane.org/gmane.linux.power-management.general/45434

Changes since v1:
  - Fixed finding cpufreq cooling devices in cpufreq_frequency_change()
  - Replaced the reworked cooling device interface with a separate
    power actor API
  - Addressed most of Eduardo's comments
  - Incorporated ftrace support for bitmask to trace cpumasks

Todo:
  - Add static power to the cpu power model
  - Change the PI controller into a PID controller
  - Turn power actors into a device
  - Let platforms override the power allocator governor parameters
  - Add more tracing and provide scripts to evaluate the proposal.
  - Tune it to achieve the temperature stability we are aiming for

Cheers,
Javi & Punit

Javi Merino (6):
  thermal: document struct thermal_zone_device and thermal_governor
  thermal: let governors have private data for each thermal zone
  thermal: introduce the Power Actor API
  thermal: add a basic cpu power actor
  thermal: introduce the Power Allocator governor
  thermal: add trace events to the power allocator governor

Steven Rostedt (Red Hat) (1):
  tracing: Add __bitmask() macro to trace events to cpumasks and other
    bitmasks

 Documentation/thermal/power_actor.txt     |   75 +++++
 Documentation/thermal/power_allocator.txt |   42 +++
 drivers/thermal/Kconfig                   |   23 ++
 drivers/thermal/Makefile                  |    3 +
 drivers/thermal/power_actor/Kconfig       |    9 +
 drivers/thermal/power_actor/Makefile      |    7 +
 drivers/thermal/power_actor/cpu_actor.c   |  424 +++++++++++++++++++++++++++
 drivers/thermal/power_actor/power_actor.c |   66 +++++
 drivers/thermal/power_actor/power_actor.h |   86 ++++++
 drivers/thermal/power_allocator.c         |  452 +++++++++++++++++++++++++++++
 drivers/thermal/thermal_core.c            |   90 +++++-
 drivers/thermal/thermal_core.h            |    8 +
 include/linux/ftrace_event.h              |    3 +
 include/linux/thermal.h                   |   58 +++-
 include/linux/trace_seq.h                 |   10 +
 include/trace/events/thermal.h            |   38 +++
 include/trace/events/thermal_governor.h   |   35 +++
 include/trace/ftrace.h                    |   57 +++-
 kernel/trace/trace_output.c               |   41 +++
 19 files changed, 1515 insertions(+), 12 deletions(-)
 create mode 100644 Documentation/thermal/power_actor.txt
 create mode 100644 Documentation/thermal/power_allocator.txt
 create mode 100644 drivers/thermal/power_actor/Kconfig
 create mode 100644 drivers/thermal/power_actor/Makefile
 create mode 100644 drivers/thermal/power_actor/cpu_actor.c
 create mode 100644 drivers/thermal/power_actor/power_actor.c
 create mode 100644 drivers/thermal/power_actor/power_actor.h
 create mode 100644 drivers/thermal/power_allocator.c
 create mode 100644 include/trace/events/thermal.h
 create mode 100644 include/trace/events/thermal_governor.h

-- 
1.7.9.5



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v2 1/7] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-20 14:10 [RFC PATCH v2 0/7] The power allocator thermal governor Javi Merino
@ 2014-05-20 14:10 ` Javi Merino
  2014-05-21  2:07   ` Steven Rostedt
  2014-05-20 14:10 ` [RFC PATCH v2 2/7] thermal: document struct thermal_zone_device and thermal_governor Javi Merino
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 11+ messages in thread
From: Javi Merino @ 2014-05-20 14:10 UTC (permalink / raw)
  To: linux-pm, linux-kernel; +Cc: Punit.Agrawal, Steven Rostedt (Red Hat)

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Being able to show a cpumask of events can be useful as some events
may affect only some CPUs. There is no standard way to record the
cpumask and converting it to a string is rather expensive during
the trace as traces happen in hotpaths. It would be better to record
the raw event mask and be able to parse it at print time.

The following macros were added for use with the TRACE_EVENT() macro:

  __bitmask()
  __assign_bitmask()
  __get_bitmask()

To test this, I added this to the sched_migrate_task event, which
looked like this:

TRACE_EVENT(sched_migrate_task,

	TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),

	TP_ARGS(p, dest_cpu, cpus),

	TP_STRUCT__entry(
		__array(	char,	comm,	TASK_COMM_LEN	)
		__field(	pid_t,	pid			)
		__field(	int,	prio			)
		__field(	int,	orig_cpu		)
		__field(	int,	dest_cpu		)
		__bitmask(	cpumask, num_possible_cpus()	)
	),

	TP_fast_assign(
		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
		__entry->pid		= p->pid;
		__entry->prio		= p->prio;
		__entry->orig_cpu	= task_cpu(p);
		__entry->dest_cpu	= dest_cpu;
		__assign_bitmask(cpumask, cpumask_bits(cpus), num_possible_cpus());
	),

	TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
		  __entry->comm, __entry->pid, __entry->prio,
		  __entry->orig_cpu, __entry->dest_cpu,
		  __get_bitmask(cpumask))
);

With the output of:

        ksmtuned-3613  [003] d..2   485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f
     migration/1-13    [001] d..5   485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f
             awk-3615  [002] d.H5   485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff
     migration/2-18    [002] d..5   485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f

Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com
Link: http://lkml.kernel.org/r/20140506132238.22e136d1@gandalf.local.home

Suggested-by: Javi Merino <javi.merino@arm.com>
Tested-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/ftrace_event.h |    3 +++
 include/linux/trace_seq.h    |   10 ++++++++
 include/trace/ftrace.h       |   57 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace_output.c  |   41 ++++++++++++++++++++++++++++++
 4 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index d16da3e53bc7..cff3106ffe2c 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -38,6 +38,9 @@ const char *ftrace_print_symbols_seq_u64(struct trace_seq *p,
 								 *symbol_array);
 #endif
 
+const char *ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
+				     unsigned int bitmask_size);
+
 const char *ftrace_print_hex_seq(struct trace_seq *p,
 				 const unsigned char *buf, int len);
 
diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
index a32d86ec8bf2..136116924d8d 100644
--- a/include/linux/trace_seq.h
+++ b/include/linux/trace_seq.h
@@ -46,6 +46,9 @@ extern int trace_seq_putmem_hex(struct trace_seq *s, const void *mem,
 extern void *trace_seq_reserve(struct trace_seq *s, size_t len);
 extern int trace_seq_path(struct trace_seq *s, const struct path *path);
 
+extern int trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
+			     int nmaskbits);
+
 #else /* CONFIG_TRACING */
 static inline int trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
 {
@@ -57,6 +60,13 @@ trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
 	return 0;
 }
 
+static inline int
+trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
+		  int nmaskbits)
+{
+	return 0;
+}
+
 static inline int trace_print_seq(struct seq_file *m, struct trace_seq *s)
 {
 	return 0;
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 0a1a4f7caf09..9b7a989dcbcc 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -53,6 +53,9 @@
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __bitmask
+#define __bitmask(item, nr_bits) __dynamic_array(char, item, -1)
+
 #undef TP_STRUCT__entry
 #define TP_STRUCT__entry(args...) args
 
@@ -128,6 +131,9 @@
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __bitmask
+#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 	struct ftrace_data_offsets_##call {				\
@@ -200,6 +206,15 @@
 #undef __get_str
 #define __get_str(field) (char *)__get_dynamic_array(field)
 
+#undef __get_bitmask
+#define __get_bitmask(field)						\
+	({								\
+		void *__bitmask = __get_dynamic_array(field);		\
+		unsigned int __bitmask_size;				\
+		__bitmask_size = (__entry->__data_loc_##field >> 16) & 0xffff; \
+		ftrace_print_bitmask_seq(p, __bitmask, __bitmask_size);	\
+	})
+
 #undef __print_flags
 #define __print_flags(flag, delim, flag_array...)			\
 	({								\
@@ -322,6 +337,9 @@ static struct trace_event_functions ftrace_event_type_funcs_##call = {	\
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __bitmask
+#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
 static int notrace __init						\
@@ -372,6 +390,29 @@ ftrace_define_fields_##call(struct ftrace_event_call *event_call)	\
 #define __string(item, src) __dynamic_array(char, item,			\
 		    strlen((src) ? (const char *)(src) : "(null)") + 1)
 
+/*
+ * __bitmask_size_in_bytes_raw is the number of bytes needed to hold
+ * num_possible_cpus().
+ */
+#define __bitmask_size_in_bytes_raw(nr_bits)	\
+	(((nr_bits) + 7) / 8)
+
+#define __bitmask_size_in_longs(nr_bits)			\
+	((__bitmask_size_in_bytes_raw(nr_bits) +		\
+	  ((BITS_PER_LONG / 8) - 1)) / (BITS_PER_LONG / 8))
+
+/*
+ * __bitmask_size_in_bytes is the number of bytes needed to hold
+ * num_possible_cpus() padded out to the nearest long. This is what
+ * is saved in the buffer, just to be consistent.
+ */
+#define __bitmask_size_in_bytes(nr_bits)				\
+	(__bitmask_size_in_longs(nr_bits) * (BITS_PER_LONG / 8))
+
+#undef __bitmask
+#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item,	\
+					 __bitmask_size_in_longs(nr_bits))
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 static inline notrace int ftrace_get_offsets_##call(			\
@@ -513,12 +554,22 @@ static inline notrace int ftrace_get_offsets_##call(			\
 	__entry->__data_loc_##item = __data_offsets.item;
 
 #undef __string
-#define __string(item, src) __dynamic_array(char, item, -1)       	\
+#define __string(item, src) __dynamic_array(char, item, -1)
 
 #undef __assign_str
 #define __assign_str(dst, src)						\
 	strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
 
+#undef __bitmask
+#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
+
+#undef __get_bitmask
+#define __get_bitmask(field) (char *)__get_dynamic_array(field)
+
+#undef __assign_bitmask
+#define __assign_bitmask(dst, src, nr_bits)					\
+	memcpy(__get_bitmask(dst), (src), __bitmask_size_in_bytes(nr_bits))
+
 #undef TP_fast_assign
 #define TP_fast_assign(args...) args
 
@@ -586,6 +637,7 @@ static inline void ftrace_test_probe_##call(void)			\
 #undef __print_hex
 #undef __get_dynamic_array
 #undef __get_str
+#undef __get_bitmask
 
 #undef TP_printk
 #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)
@@ -651,6 +703,9 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
 #undef __get_str
 #define __get_str(field) (char *)__get_dynamic_array(field)
 
+#undef __get_bitmask
+#define __get_bitmask(field) (char *)__get_dynamic_array(field)
+
 #undef __perf_addr
 #define __perf_addr(a)	(__addr = (a))
 
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a436de18aa99..f3dad80c20b2 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -126,6 +126,34 @@ trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
 EXPORT_SYMBOL_GPL(trace_seq_printf);
 
 /**
+ * trace_seq_bitmask - put a list of longs as a bitmask print output
+ * @s:		trace sequence descriptor
+ * @maskp:	points to an array of unsigned longs that represent a bitmask
+ * @nmaskbits:	The number of bits that are valid in @maskp
+ *
+ * It returns 0 if the trace oversizes the buffer's free
+ * space, 1 otherwise.
+ *
+ * Writes a ASCII representation of a bitmask string into @s.
+ */
+int
+trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
+		  int nmaskbits)
+{
+	int len = (PAGE_SIZE - 1) - s->len;
+	int ret;
+
+	if (s->full || !len)
+		return 0;
+
+	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
+	s->len += ret;
+
+	return 1;
+}
+EXPORT_SYMBOL_GPL(trace_seq_bitmask);
+
+/**
  * trace_seq_vprintf - sequence printing of trace information
  * @s: trace sequence descriptor
  * @fmt: printf format string
@@ -399,6 +427,19 @@ EXPORT_SYMBOL(ftrace_print_symbols_seq_u64);
 #endif
 
 const char *
+ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
+			 unsigned int bitmask_size)
+{
+	const char *ret = p->buffer + p->len;
+
+	trace_seq_bitmask(p, bitmask_ptr, bitmask_size * 8);
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ftrace_print_bitmask_seq);
+
+const char *
 ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int buf_len)
 {
 	int i;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v2 2/7] thermal: document struct thermal_zone_device and thermal_governor
  2014-05-20 14:10 [RFC PATCH v2 0/7] The power allocator thermal governor Javi Merino
  2014-05-20 14:10 ` [RFC PATCH v2 1/7] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks Javi Merino
@ 2014-05-20 14:10 ` Javi Merino
  2014-05-20 14:10 ` [RFC PATCH v2 3/7] thermal: let governors have private data for each thermal zone Javi Merino
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Javi Merino @ 2014-05-20 14:10 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin

Document struct thermal_zone_device and struct thermal_governor fields
and their use by the thermal framework code.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---

Hi linux-pm,

This was sent as a separate patch to linux-pm and can be merged
independently, as it documents the current thermal framework.

 include/linux/thermal.h |   44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index f7e11c7ea7d9..9b7cb804e03f 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -158,6 +158,40 @@ struct thermal_attr {
 	char name[THERMAL_NAME_LENGTH];
 };
 
+/**
+ * struct thermal_zone_device - structure for a thermal zone
+ * @id:		unique id number for each thermal zone
+ * @type:	the thermal zone device type
+ * @device:	struct device for this thermal zone
+ * @trip_temp_attrs:	attributes for trip points for sysfs: trip temperature
+ * @trip_type_attrs:	attributes for trip points for sysfs: trip type
+ * @trip_hyst_attrs:	attributes for trip points for sysfs: trip hysteresis
+ * @devdata:	private pointer for device private data
+ * @trips:	number of trip points the thermal zone supports
+ * @passive_delay:	number of milliseconds to wait between polls when
+ *			performing passive cooling.  Only used by the step-wise
+ *			governor
+ * @polling_delay:	number of milliseconds to wait between polls when
+ *			checking whether trip points have been crossed (0 for
+ *			interrupt driven systems)
+ * @temperature:	current temperature.  This is only for core code,
+ *			drivers should use thermal_zone_get_temp() to get the
+ *			current temperature
+ * @last_temperature:	previous temperature read
+ * @emul_temperature:	emulated temperature when using CONFIG_THERMAL_EMULATION
+ * @passive:	step-wise specific parameter.  1 if you've crossed a passive
+ *		trip point, 0 otherwise
+ * @forced_passive:	step-wise specific parameter.  If > 0, temperature at
+ *			which to switch on all cpufreq cooling devices.
+ * @ops:	operations this thermal_zone_device supports
+ * @tzp:	thermal zone parameters
+ * @governor:	pointer to the governor for this thermal zone
+ * @thermal_instances:	list of struct thermal_instance of this thermal zone
+ * @idr:	struct idr to generate unique id for this zone's cooling devices
+ * @lock:	lock to protect thermal_instances list
+ * @node:	node in thermal_tz_list (in thermal_core.c)
+ * @poll_queue:	delayed work for polling
+ */
 struct thermal_zone_device {
 	int id;
 	char type[THERMAL_NAME_LENGTH];
@@ -179,12 +213,18 @@ struct thermal_zone_device {
 	struct thermal_governor *governor;
 	struct list_head thermal_instances;
 	struct idr idr;
-	struct mutex lock; /* protect thermal_instances list */
+	struct mutex lock;
 	struct list_head node;
 	struct delayed_work poll_queue;
 };
 
-/* Structure that holds thermal governor information */
+/**
+ * struct thermal_governor - structure that holds thermal governor information
+ * @name:	name of the governor
+ * @throttle:	callback called for every trip point even if temperature is
+ *		below the trip point temperature
+ * @governor_list:	node in thermal_governor_list (in thermal_core.c)
+ */
 struct thermal_governor {
 	char name[THERMAL_NAME_LENGTH];
 	int (*throttle)(struct thermal_zone_device *tz, int trip);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v2 3/7] thermal: let governors have private data for each thermal zone
  2014-05-20 14:10 [RFC PATCH v2 0/7] The power allocator thermal governor Javi Merino
  2014-05-20 14:10 ` [RFC PATCH v2 1/7] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks Javi Merino
  2014-05-20 14:10 ` [RFC PATCH v2 2/7] thermal: document struct thermal_zone_device and thermal_governor Javi Merino
@ 2014-05-20 14:10 ` Javi Merino
  2014-05-20 14:10 ` [RFC PATCH v2 4/7] thermal: introduce the Power Actor API Javi Merino
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Javi Merino @ 2014-05-20 14:10 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin

A governor may need to store its current state between calls to
throttle().  That state depends on the thermal zone, so store it as
private data in struct thermal_zone_device.

The governors may have two new ops: bind_to_tz() and unbind_from_tz().
When provided, these functions let governors do some initialization
and teardown when they are bound/unbound to a tz and possibly store that
information in the governor_data field of the struct
thermal_zone_device.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 drivers/thermal/thermal_core.c |   83 ++++++++++++++++++++++++++++++++++++----
 include/linux/thermal.h        |    9 +++++
 2 files changed, 84 insertions(+), 8 deletions(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 71b0ec0c370d..1b13d8e0cfd1 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -72,6 +72,58 @@ static struct thermal_governor *__find_governor(const char *name)
 	return NULL;
 }
 
+/**
+ * bind_previous_governor - bind the previous governor of the thermal zone
+ * @tz:		a valid pointer to a struct thermal_zone_device
+ * @failed_gov_name:	the name of the governor that failed to register
+ *
+ * Register the previous governor of the thermal zone after a new
+ * governor has failed to be bound.
+ */
+static void bind_previous_governor(struct thermal_zone_device *tz,
+				const char *failed_gov_name)
+{
+	if (tz->governor && tz->governor->bind_to_tz) {
+		if (tz->governor->bind_to_tz(tz)) {
+			dev_warn(&tz->device,
+				"governor %s failed to bind and the previous one (%s) failed to register again, thermal zone %s has no governor\n",
+				failed_gov_name, tz->governor->name, tz->type);
+			tz->governor = NULL;
+		}
+	}
+}
+
+/**
+ * thermal_set_governor() - Switch to another governor
+ * @tz:		a valid pointer to a struct thermal_zone_device
+ * @new_gov:	pointer to the new governor
+ *
+ * Change the governor of thermal zone @tz.
+ *
+ * Returns 0 on success, an error if the new governor's bind_to_tz() failed.
+ */
+static int thermal_set_governor(struct thermal_zone_device *tz,
+				struct thermal_governor *new_gov)
+{
+	int ret = 0;
+
+	if (tz->governor && tz->governor->unbind_from_tz)
+		tz->governor->unbind_from_tz(tz);
+
+	if (new_gov && new_gov->bind_to_tz) {
+		ret = new_gov->bind_to_tz(tz);
+		if (ret) {
+			bind_previous_governor(tz, new_gov->name);
+
+			return ret;
+		}
+	}
+
+	tz->governor = new_gov;
+
+	return ret;
+}
+
 int thermal_register_governor(struct thermal_governor *governor)
 {
 	int err;
@@ -104,8 +156,15 @@ int thermal_register_governor(struct thermal_governor *governor)
 
 		name = pos->tzp->governor_name;
 
-		if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
-			pos->governor = governor;
+		if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
+			int ret;
+
+			ret = thermal_set_governor(pos, governor);
+			if (ret)
+				dev_warn(&pos->device,
+					"Failed to set governor %s for thermal zone %s: %d\n",
+					governor->name, pos->type, ret);
+		}
 	}
 
 	mutex_unlock(&thermal_list_lock);
@@ -131,7 +190,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
 	list_for_each_entry(pos, &thermal_tz_list, node) {
 		if (!strnicmp(pos->governor->name, governor->name,
 						THERMAL_NAME_LENGTH))
-			pos->governor = NULL;
+			thermal_set_governor(pos, NULL);
 	}
 
 	mutex_unlock(&thermal_list_lock);
@@ -756,8 +815,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
 	if (!gov)
 		goto exit;
 
-	tz->governor = gov;
-	ret = count;
+	ret = thermal_set_governor(tz, gov);
+	if (!ret)
+		ret = count;
 
 exit:
 	mutex_unlock(&thermal_governor_lock);
@@ -1452,6 +1512,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
 	int result;
 	int count;
 	int passive = 0;
+	struct thermal_governor *governor;
 
 	if (type && strlen(type) >= THERMAL_NAME_LENGTH)
 		return ERR_PTR(-EINVAL);
@@ -1542,9 +1603,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
 	mutex_lock(&thermal_governor_lock);
 
 	if (tz->tzp)
-		tz->governor = __find_governor(tz->tzp->governor_name);
+		governor = __find_governor(tz->tzp->governor_name);
 	else
-		tz->governor = def_governor;
+		governor = def_governor;
+
+	result = thermal_set_governor(tz, governor);
+	if (result) {
+		mutex_unlock(&thermal_governor_lock);
+		goto unregister;
+	}
 
 	mutex_unlock(&thermal_governor_lock);
 
@@ -1634,7 +1701,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
 		device_remove_file(&tz->device, &dev_attr_mode);
 	device_remove_file(&tz->device, &dev_attr_policy);
 	remove_trip_attrs(tz);
-	tz->governor = NULL;
+	thermal_set_governor(tz, NULL);
 
 	thermal_remove_hwmon_sysfs(tz);
 	release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index 9b7cb804e03f..06971c4779a8 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -186,6 +186,7 @@ struct thermal_attr {
  * @ops:	operations this thermal_zone_device supports
  * @tzp:	thermal zone parameters
  * @governor:	pointer to the governor for this thermal zone
+ * @governor_data:	private pointer for governor data
  * @thermal_instances:	list of struct thermal_instance of this thermal zone
  * @idr:	struct idr to generate unique id for this zone's cooling devices
  * @lock:	lock to protect thermal_instances list
@@ -211,6 +212,7 @@ struct thermal_zone_device {
 	struct thermal_zone_device_ops *ops;
 	const struct thermal_zone_params *tzp;
 	struct thermal_governor *governor;
+	void *governor_data;
 	struct list_head thermal_instances;
 	struct idr idr;
 	struct mutex lock;
@@ -221,12 +223,19 @@ struct thermal_zone_device {
 /**
  * struct thermal_governor - structure that holds thermal governor information
  * @name:	name of the governor
+ * @bind_to_tz: callback called when binding to a thermal zone.  If it
+ *		returns 0, the governor is bound to the thermal zone,
+ *		otherwise it fails.
+ * @unbind_from_tz:	callback called when a governor is unbound from a
+ *			thermal zone.
  * @throttle:	callback called for every trip point even if temperature is
  *		below the trip point temperature
  * @governor_list:	node in thermal_governor_list (in thermal_core.c)
  */
 struct thermal_governor {
 	char name[THERMAL_NAME_LENGTH];
+	int (*bind_to_tz)(struct thermal_zone_device *tz);
+	void (*unbind_from_tz)(struct thermal_zone_device *tz);
 	int (*throttle)(struct thermal_zone_device *tz, int trip);
 	struct list_head	governor_list;
 };
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v2 4/7] thermal: introduce the Power Actor API
  2014-05-20 14:10 [RFC PATCH v2 0/7] The power allocator thermal governor Javi Merino
                   ` (2 preceding siblings ...)
  2014-05-20 14:10 ` [RFC PATCH v2 3/7] thermal: let governors have private data for each thermal zone Javi Merino
@ 2014-05-20 14:10 ` Javi Merino
  2014-05-20 14:10 ` [RFC PATCH v2 5/7] thermal: add a basic cpu power actor Javi Merino
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Javi Merino @ 2014-05-20 14:10 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin

This patch introduces the Power Actor API in the thermal framework.
With it, devices that can report their power consumption and control
it can be registered.  This base interface is meant to be used to
derive specific power actors, such as a cpu power actor.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 Documentation/thermal/power_actor.txt     |   29 +++++++++++++
 drivers/thermal/Kconfig                   |    3 ++
 drivers/thermal/Makefile                  |    2 +
 drivers/thermal/power_actor/Makefile      |    5 +++
 drivers/thermal/power_actor/power_actor.c |   66 +++++++++++++++++++++++++++++
 drivers/thermal/power_actor/power_actor.h |   63 +++++++++++++++++++++++++++
 6 files changed, 168 insertions(+)
 create mode 100644 Documentation/thermal/power_actor.txt
 create mode 100644 drivers/thermal/power_actor/Makefile
 create mode 100644 drivers/thermal/power_actor/power_actor.c
 create mode 100644 drivers/thermal/power_actor/power_actor.h

diff --git a/Documentation/thermal/power_actor.txt b/Documentation/thermal/power_actor.txt
new file mode 100644
index 000000000000..a0f06e091907
--- /dev/null
+++ b/Documentation/thermal/power_actor.txt
@@ -0,0 +1,29 @@
+
+Power Actor API
+===============
+
+The base power actor API is meant to be used to derive specific power
+actors, such as a cpu power actor.  When registering, they should call
+`power_actor_register()` with a unique `enum power_actor_types`.  When
+unregistering, the power actor should call `power_actor_unregister()`
+with the `struct power_actor *` received in the call to
+`power_actor_register()`.
+
+Callbacks
+---------
+
+1. u32 get_req_power(struct power_actor *actor)
+@actor: a valid `struct power_actor *` registered with
+        `power_actor_register()`
+
+`get_req_power()` returns the current requested power in milliwatts.
+
+2. int set_power(struct power_actor *actor, u32 power)
+@actor: a valid `struct power_actor *` registered with
+        `power_actor_register()`
+@power: power in milliwatts
+
+`set_power()` should configure the device to consume @power
+milliwatts.
+
+Returns 0 on success, -E* on error.
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 2d51912a6e40..47e2f15537ca 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -89,6 +89,9 @@ config THERMAL_GOV_USER_SPACE
 	help
 	  Enable this to let the user space manage the platform thermals.
 
+config THERMAL_POWER_ACTOR
+	bool
+
 config CPU_THERMAL
 	bool "generic cpu cooling support"
 	depends on CPU_FREQ
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 54e4ec9eb5df..878a02cab7d1 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -14,6 +14,8 @@ thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE)	+= fair_share.o
 thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE)	+= step_wise.o
 thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE)	+= user_space.o
 
+obj-$(CONFIG_THERMAL_POWER_ACTOR) += power_actor/
+
 # cpufreq cooling
 thermal_sys-$(CONFIG_CPU_THERMAL)	+= cpu_cooling.o
 
diff --git a/drivers/thermal/power_actor/Makefile b/drivers/thermal/power_actor/Makefile
new file mode 100644
index 000000000000..46478f4928be
--- /dev/null
+++ b/drivers/thermal/power_actor/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the power actors
+#
+
+obj-y += power_actor.o
diff --git a/drivers/thermal/power_actor/power_actor.c b/drivers/thermal/power_actor/power_actor.c
new file mode 100644
index 000000000000..3edcb1ab4dff
--- /dev/null
+++ b/drivers/thermal/power_actor/power_actor.c
@@ -0,0 +1,66 @@
+/*
+ * Basic interface for power actors
+ *
+ * Copyright (C) 2014 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) "Power actor: " fmt
+
+#include <linux/err.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+
+#include "power_actor.h"
+
+LIST_HEAD(actor_list);
+
+/**
+ * power_actor_register - Register an actor in the power actor API
+ * @type:	actor type
+ * @ops:	struct power_actor_ops for this actor
+ * @max_power:	maximum power that this actor can consume
+ * @privdata:	pointer to private data related to the actor
+ *
+ * Returns the struct power_actor * on success, ERR_PTR() on failure
+ */
+struct power_actor *power_actor_register(enum power_actor_types type,
+					struct power_actor_ops *ops,
+					u32 max_power, void *privdata)
+{
+	struct power_actor *actor;
+
+	if (!ops->get_req_power || !ops->set_power)
+		return ERR_PTR(-EINVAL);
+
+	actor = kzalloc(sizeof(*actor), GFP_KERNEL);
+	if (!actor)
+		return ERR_PTR(-ENOMEM);
+
+	actor->type = type;
+	actor->ops = ops;
+	actor->max_power = max_power;
+	actor->data = privdata;
+
+	list_add(&actor->actor_node, &actor_list);
+
+	return actor;
+}
+
+/**
+ * power_actor_unregister - Unregister an actor
+ * @actor:	the actor to unregister
+ */
+void power_actor_unregister(struct power_actor *actor)
+{
+	list_del(&actor->actor_node);
+	kfree(actor);
+}
diff --git a/drivers/thermal/power_actor/power_actor.h b/drivers/thermal/power_actor/power_actor.h
new file mode 100644
index 000000000000..82be19ce741d
--- /dev/null
+++ b/drivers/thermal/power_actor/power_actor.h
@@ -0,0 +1,63 @@
+/*
+ * Copyright (C) 2014 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __POWER_ACTOR_H__
+#define __POWER_ACTOR_H__
+
+#include <linux/list.h>
+
+#define MAX_NUM_ACTORS 8
+
+enum power_actor_types {
+};
+
+struct power_actor;
+
+/**
+ * struct power_actor_ops - callbacks for power actors
+ * @get_req_power:	return the current requested power in milliwatts
+ * @set_power:		configure the device to consume a certain power in
+ *			milliwatts
+ */
+struct power_actor_ops {
+	u32 (*get_req_power)(struct power_actor *);
+	int (*set_power)(struct power_actor *, u32);
+};
+
+/**
+ * struct power_actor - structure for a power actor
+ * @type:	the type of power actor.
+ * @ops:	callbacks for the power actor
+ * @max_power:	the maximum power that this actor can consume, in milliwatts
+ * @data:	a private pointer for type-specific data
+ * @actor_node:	node in actor_list
+ */
+struct power_actor {
+	enum power_actor_types type;
+	struct power_actor_ops *ops;
+	u32 max_power;
+	void *data;
+	struct list_head actor_node;
+};
+
+struct power_actor *power_actor_register(enum power_actor_types type,
+					struct power_actor_ops *ops,
+					u32 max_power, void *privdata);
+void power_actor_unregister(struct power_actor *actor);
+
+extern struct list_head actor_list;
+
+#endif /* __POWER_ACTOR_H__ */
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v2 5/7] thermal: add a basic cpu power actor
  2014-05-20 14:10 [RFC PATCH v2 0/7] The power allocator thermal governor Javi Merino
                   ` (3 preceding siblings ...)
  2014-05-20 14:10 ` [RFC PATCH v2 4/7] thermal: introduce the Power Actor API Javi Merino
@ 2014-05-20 14:10 ` Javi Merino
  2014-05-20 14:10 ` [RFC PATCH v2 6/7] thermal: introduce the Power Allocator governor Javi Merino
  2014-05-20 14:10 ` [RFC PATCH v2 7/7] thermal: add trace events to the power allocator governor Javi Merino
  6 siblings, 0 replies; 11+ messages in thread
From: Javi Merino @ 2014-05-20 14:10 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin,
	Punit Agrawal

Introduce a power actor for cpus.  It has a basic power model to get
the current power utilization and uses cpufreq cooling devices to set
the desired power.  It uses the current frequency (as reported by
cpufreq) as well as load and OPPs for the power calculations.  The
cpus must have registered their OPPs in the OPP library.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 Documentation/thermal/power_actor.txt     |   46 ++++
 drivers/thermal/Kconfig                   |    5 +
 drivers/thermal/power_actor/Kconfig       |    9 +
 drivers/thermal/power_actor/Makefile      |    2 +
 drivers/thermal/power_actor/cpu_actor.c   |  419 +++++++++++++++++++++++++++++
 drivers/thermal/power_actor/power_actor.h |   23 ++
 6 files changed, 504 insertions(+)
 create mode 100644 drivers/thermal/power_actor/Kconfig
 create mode 100644 drivers/thermal/power_actor/cpu_actor.c

diff --git a/Documentation/thermal/power_actor.txt b/Documentation/thermal/power_actor.txt
index a0f06e091907..d74909376610 100644
--- a/Documentation/thermal/power_actor.txt
+++ b/Documentation/thermal/power_actor.txt
@@ -27,3 +27,49 @@ Callbacks
 milliwatts.
 
 Returns 0 on success, -E* on error.
+
+CPU Power Actor API
+===================
+A simple power model for CPUs.  The current power is calculated as
+dynamic power.  The dynamic power consumption of a processor depends
+on many factors.  For a given processor implementation the primary
+factors are:
+
+- The time the processor spends running, consuming dynamic power, as
+  compared to the time in idle states where dynamic consumption is
+  negligible.  Herein we refer to this as 'utilisation'.
+- The voltage and frequency levels as a result of DVFS.  The DVFS
+  level is a dominant factor governing power consumption.
+- In running time the 'execution' behaviour (instruction types, memory
+  access patterns and so forth) causes, in most cases, a second order
+  variation.  In pathological cases this variation can be significant,
+  but typically it is of a much lesser impact than the factors above.
+
+A high level dynamic power consumption model may then be represented as:
+
+Pdyn = f(run) * Voltage^2 * Frequency * Utilisation
+
+f(run) here represents the described execution behaviour and its
+result has a units of Watts/Hz/Volt^2 (this often expressed in
+mW/MHz/uVolt^2)
+
+The detailed behaviour for f(run) could be modelled on-line.  However,
+in practice, such an on-line model has dependencies on a number of
+implementation specific processor support and characterisation
+factors.  Therefore, in initial implementation that contribution is
+represented as a constant coefficient.  This is a simplification
+consistent with the relative contribution to overall power variation.
+
+In this simplified representation our model becomes:
+
+Pdyn = Kd * Voltage^2 * Frequency * Utilisation
+
+Where Kd (capacitance) represents an indicative running time dynamic
+power coefficient in fundamental units of mW/MHz/uVolt^2
+
+This power model requires that the operating-points of the CPUs are
+registered using the kernel's opp library and the
+`cpufreq_frequency_table` is assigned to the `struct device` of the
+cpu.  If you are using the `cpufreq-cpu0.c` driver then the
+`cpufreq_frequency_table` should already be assigned to the cpu
+device.
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 47e2f15537ca..1818c4fa60b8 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -92,6 +92,11 @@ config THERMAL_GOV_USER_SPACE
 config THERMAL_POWER_ACTOR
 	bool
 
+menu "Power actors"
+depends on THERMAL_POWER_ACTOR
+source "drivers/thermal/power_actor/Kconfig"
+endmenu
+
 config CPU_THERMAL
 	bool "generic cpu cooling support"
 	depends on CPU_FREQ
diff --git a/drivers/thermal/power_actor/Kconfig b/drivers/thermal/power_actor/Kconfig
new file mode 100644
index 000000000000..fa542ca99cdb
--- /dev/null
+++ b/drivers/thermal/power_actor/Kconfig
@@ -0,0 +1,9 @@
+#
+# Thermal power actor configuration
+#
+
+config THERMAL_POWER_ACTOR_CPU
+	bool
+	prompt "Simple power model for a CPU"
+	help
+	  A simple CPU power model
diff --git a/drivers/thermal/power_actor/Makefile b/drivers/thermal/power_actor/Makefile
index 46478f4928be..6f04b92997e6 100644
--- a/drivers/thermal/power_actor/Makefile
+++ b/drivers/thermal/power_actor/Makefile
@@ -3,3 +3,5 @@
 #
 
 obj-y += power_actor.o
+
+obj-$(CONFIG_THERMAL_POWER_ACTOR_CPU) += cpu_actor.o
diff --git a/drivers/thermal/power_actor/cpu_actor.c b/drivers/thermal/power_actor/cpu_actor.c
new file mode 100644
index 000000000000..0d76d52609fa
--- /dev/null
+++ b/drivers/thermal/power_actor/cpu_actor.c
@@ -0,0 +1,419 @@
+/*
+ * A basic cpu actor
+ *
+ * Copyright (C) 2014 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) "CPU actor: " fmt
+
+#include <linux/cpu.h>
+#include <linux/cpufreq.h>
+#include <linux/cpumask.h>
+#include <linux/cpu_cooling.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/pm_opp.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+
+#include "power_actor.h"
+
+/**
+ * struct power_table - frequency to power conversion
+ * @frequency:	frequency in KHz
+ * @power:	power in mW
+ *
+ * This structure is built when the cooling device registers and helps
+ * in translating frequency to power and viceversa.
+ */
+struct power_table {
+	u32 frequency;
+	u32 power;
+};
+
+/**
+ * struct cpu_actor - information for each cpu actor
+ * @cpumask:	cpus covered by this actor
+ * @freq:	frequency in KHz of the cpus represented by the cooling device
+ * @last_load: load measured by the latest call to cpu_get_req_power()
+ * @capacitance: the dynamic power coefficient of these cpus
+ * @time_in_idle: previous reading of the absolute time that this cpu was idle
+ * @time_in_idle_timestamp: wall time of the last invocation of
+ *	get_cpu_idle_time_us()
+ * @power_table: array of struct power_table for frequency to power conversion
+ * @power_table_entries: number of entries in the @power_table array
+ * @cdev:	cpufreq cooling device associated with this actor
+ */
+struct cpu_actor {
+	cpumask_t cpumask;
+	u32 freq;
+	u32 last_load;
+	u32 capacitance;
+	u64 time_in_idle[NR_CPUS];
+	u64 time_in_idle_timestamp[NR_CPUS];
+	struct power_table *power_table;
+	int power_table_entries;
+	struct thermal_cooling_device *cdev;
+};
+
+static DEFINE_MUTEX(cpu_power_actor_lock);
+
+static unsigned int cpu_power_actors_registered;
+
+static u32 cpu_freq_to_power(struct cpu_actor *cpu_actor, u32 freq)
+{
+	int i;
+	struct power_table *pt = cpu_actor->power_table;
+
+	for (i = 0; i < cpu_actor->power_table_entries - 1; i++)
+		if (freq <= pt[i].frequency)
+			break;
+
+	return pt[i].power;
+}
+
+static u32 cpu_power_to_freq(struct cpu_actor *cpu_actor, u32 power)
+{
+	int i;
+	struct power_table *pt = cpu_actor->power_table;
+
+	for (i = 0; i < cpu_actor->power_table_entries - 1; i++)
+		if (power <= pt[i].power)
+			break;
+
+	return pt[i].frequency;
+}
+
+/**
+ * get_load - get load for a cpu since last updated
+ * @cpu_actor: struct cpu_actor for this actor
+ * @cpu: cpu number
+ *
+ * Return the average load of cpu @cpu in percentage since this
+ * function was last called.
+ */
+static u32 get_load(struct cpu_actor *cpu_actor, int cpu)
+{
+	u32 load;
+	u64 now, now_idle, delta_time, delta_idle;
+
+	now_idle = get_cpu_idle_time(cpu, &now, 0);
+	delta_idle = now_idle - cpu_actor->time_in_idle[cpu];
+	delta_time = now - cpu_actor->time_in_idle_timestamp[cpu];
+
+	if (delta_time <= delta_idle)
+		load = 0;
+	else
+		load = div64_u64(100 * (delta_time - delta_idle), delta_time);
+
+	cpu_actor->time_in_idle[cpu] = now_idle;
+	cpu_actor->time_in_idle_timestamp[cpu] = now;
+
+	return load;
+}
+
+/**
+ * cpu_get_req_power - get the current power
+ * @actor: power actor pointer
+ *
+ * Callback for the power actor to return the current power
+ * consumption in milliwatts.
+ */
+static u32 cpu_get_req_power(struct power_actor *actor)
+{
+	int cpu;
+	u32 power = 0, raw_cpu_power, total_load = 0;
+	struct cpu_actor *cpu_actor = actor->data;
+
+	raw_cpu_power = cpu_freq_to_power(cpu_actor, cpu_actor->freq);
+
+	for_each_cpu(cpu, &cpu_actor->cpumask) {
+		u32 load;
+
+		if (!cpu_online(cpu))
+			continue;
+
+		load = get_load(cpu_actor, cpu);
+		power += (raw_cpu_power * load) / 100;
+		total_load += load;
+	}
+
+	cpu_actor->last_load = total_load;
+
+	return power;
+}
+
+/**
+ * cpu_set_power - set cpufreq cooling device to consume a certain power
+ * @actor: power actor pointer
+ * @power: the power in milliwatts that should be set
+ *
+ * Callback for the power actor to configure the power consumption of
+ * the CPU to be @power milliwatts at most.  This function assumes
+ * that the load will remain constant.  The power is translated into a
+ * cooling state that the cpu cooling device then sets.
+ *
+ * Returns 0 on success, -EINVAL if it couldn't convert the frequency
+ * to a cpufreq cooling device state.
+ */
+static int cpu_set_power(struct power_actor *actor, u32 power)
+{
+	unsigned int cpu, freq;
+	unsigned long cdev_state;
+	u32 normalised_power, last_load;
+	struct thermal_cooling_device *cdev;
+	struct cpu_actor *cpu_actor = actor->data;
+
+	cdev = cpu_actor->cdev;
+	cpu = cpumask_any(&cpu_actor->cpumask);
+	last_load = cpu_actor->last_load ? cpu_actor->last_load : 1;
+	normalised_power = (power * 100) / last_load;
+	freq = cpu_power_to_freq(cpu_actor, normalised_power);
+
+	cdev_state = cpufreq_cooling_get_level(cpu, freq);
+	if (cdev_state == THERMAL_CSTATE_INVALID) {
+		pr_err("Failed to convert %dKHz for cpu %d into a cdev state\n",
+			freq, cpu);
+		return -EINVAL;
+	}
+
+	return cdev->ops->set_cur_state(cdev, cdev_state);
+}
+
+static struct power_actor_ops cpu_actor_ops = {
+	.get_req_power = cpu_get_req_power,
+	.set_power = cpu_set_power,
+};
+
+/**
+ * cpufreq_frequency_change - notifier callback for cpufreq frequency changes
+ * @nb:		struct notifier_block * with callback info
+ * @event:	value showing cpufreq event for which this function invoked
+ * @data:	callback-specific data
+ *
+ * Callback to get notifications of frequency changes.  In the
+ * CPUFREQ_POSTCHANGE @event we store the new frequency so that
+ * cpufreq_get_cur() knows the current frequency and can convert it
+ * into power.
+ */
+static int cpufreq_frequency_change(struct notifier_block *nb,
+				unsigned long event, void *data)
+{
+	struct power_actor *actor;
+	struct cpufreq_freqs *freqs = data;
+
+	/* Only update frequency on postchange */
+	if (event != CPUFREQ_POSTCHANGE)
+		return NOTIFY_OK;
+
+	list_for_each_entry(actor, &actor_list, actor_node) {
+		struct cpu_actor *cpu_actor;
+
+		if (actor->type != POWER_ACTOR_CPU)
+			continue;
+
+		cpu_actor = actor->data;
+
+		if (cpumask_test_cpu(freqs->cpu, &cpu_actor->cpumask))
+			cpu_actor->freq = freqs->new;
+	}
+
+	return NOTIFY_OK;
+}
+
+struct notifier_block cpufreq_transition_notifier = {
+	.notifier_call = cpufreq_frequency_change,
+};
+
+/**
+ * build_cpu_power_table - create a power to frequency table
+ * @cpu_actor:	the cpu_actor in which to store the table
+ *
+ * Build a power to frequency table for this cpu and store it in
+ * @cpu_actor.  This table will be used in cpu_power_to_freq() and
+ * cpu_freq_to_power() to convert between power and frequency
+ * efficiently.  Power is stored in mW, frequency in KHz.  The
+ * resulting table is in ascending order.
+ *
+ * Returns 0 on success, -E* on error.
+ */
+static int build_cpu_power_table(struct cpu_actor *cpu_actor)
+{
+	struct power_table *power_table;
+	struct dev_pm_opp *opp;
+	struct device *dev = NULL;
+	int num_opps, cpu, i, ret = 0;
+	unsigned long freq;
+
+	num_opps = 0;
+
+	rcu_read_lock();
+
+	for_each_cpu(cpu, &cpu_actor->cpumask) {
+		dev = get_cpu_device(cpu);
+		if (!dev)
+			continue;
+
+		num_opps = dev_pm_opp_get_opp_count(dev);
+		if (num_opps > 0) {
+			break;
+		} else if (num_opps < 0) {
+			ret = num_opps;
+			goto unlock;
+		}
+	}
+
+	if (num_opps == 0) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	power_table = kcalloc(num_opps, sizeof(*power_table), GFP_KERNEL);
+
+	i = 0;
+	for (freq = 0;
+	     opp = dev_pm_opp_find_freq_ceil(dev, &freq), !IS_ERR(opp);
+	     freq++) {
+		u32 freq_mhz, voltage_mv;
+		u64 power;
+
+		freq_mhz = freq / 1000000;
+		voltage_mv = dev_pm_opp_get_voltage(opp) / 1000;
+
+		/*
+		 * Do the multiplication with MHz and millivolt so as
+		 * to not overflow.
+		 */
+		power = (u64)cpu_actor->capacitance * freq_mhz *
+			voltage_mv * voltage_mv;
+		do_div(power, 1000000000);
+
+		/* frequency is stored in power_table in KHz */
+		power_table[i].frequency = freq / 1000;
+		power_table[i].power = power;
+
+		i++;
+	}
+
+	if (i == 0) {
+		ret = PTR_ERR(opp);
+		goto unlock;
+	}
+
+	cpu_actor->power_table = power_table;
+	cpu_actor->power_table_entries = i;
+
+unlock:
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
+ * power_cpu_actor_register - register a cpu_actor within the power actor API
+ * @cpu_mask:	cpumask of cpus covered by this power_actor
+ * @cdev:	cpufreq cooling device associated with this actor
+ * @capacitance: dynamic power coefficient for these cpus
+ *
+ * Register the cpus in @cpumask with the power actor API using a
+ * simple cpu power model.  The cpus must have registered their OPPs
+ * in the OPP library.
+ *
+ * Return the power_actor created on success or the corresponding
+ * ERR_PTR() on failure.  This actor should be freed with
+ * power_cpu_actor_unregister() when it's no longer needed.
+ */
+struct power_actor *power_cpu_actor_register(cpumask_t *cpumask,
+					struct thermal_cooling_device *cdev,
+					u32 capacitance)
+{
+	int ret;
+	struct power_actor *actor, *err_ret;
+	struct cpu_actor *cpu_actor;
+	u32 cpu_max_power;
+	unsigned int last_entry;
+
+	cpu_actor = kzalloc(sizeof(*cpu_actor), GFP_KERNEL);
+	if (!cpu_actor)
+		return ERR_PTR(-ENOMEM);
+
+	cpumask_copy(&cpu_actor->cpumask, cpumask);
+	cpu_actor->cdev = cdev;
+	cpu_actor->capacitance = capacitance;
+
+	ret = build_cpu_power_table(cpu_actor);
+	if (ret) {
+		err_ret = ERR_PTR(ret);
+		goto kfree;
+	}
+
+	last_entry = cpu_actor->power_table_entries - 1;
+	cpu_max_power = cpu_actor->power_table[last_entry].power;
+	cpu_max_power *= cpumask_weight(cpumask);
+
+	actor = power_actor_register(POWER_ACTOR_CPU, &cpu_actor_ops,
+				cpu_max_power, cpu_actor);
+	if (IS_ERR(actor)) {
+		err_ret = actor;
+		goto kfree;
+	}
+
+	mutex_lock(&cpu_power_actor_lock);
+
+	/*
+	 * You can't register multiple times the same notifier_block.
+	 * The first power actor registered is the only one that
+	 * registers the notifier.
+	 */
+	if (!cpu_power_actors_registered) {
+		ret = cpufreq_register_notifier(&cpufreq_transition_notifier,
+						CPUFREQ_TRANSITION_NOTIFIER);
+		if (ret) {
+			err_ret = ERR_PTR(ret);
+			mutex_unlock(&cpu_power_actor_lock);
+			goto power_actor_unregister;
+		}
+	}
+
+	cpu_power_actors_registered++;
+	mutex_unlock(&cpu_power_actor_lock);
+
+	return actor;
+
+power_actor_unregister:
+	power_actor_unregister(actor);
+kfree:
+	kfree(cpu_actor);
+
+	return err_ret;
+}
+
+void power_cpu_actor_unregister(struct power_actor *actor)
+{
+	struct cpu_actor *cpu_actor = actor->data;
+
+	kfree(cpu_actor->power_table);
+
+	mutex_lock(&cpu_power_actor_lock);
+
+	cpu_power_actors_registered--;
+
+	if (!cpu_power_actors_registered)
+		cpufreq_unregister_notifier(&cpufreq_transition_notifier,
+					CPUFREQ_TRANSITION_NOTIFIER);
+
+	mutex_unlock(&cpu_power_actor_lock);
+
+	kfree(cpu_actor);
+	power_actor_unregister(actor);
+}
diff --git a/drivers/thermal/power_actor/power_actor.h b/drivers/thermal/power_actor/power_actor.h
index 82be19ce741d..fe5c8cc3da3c 100644
--- a/drivers/thermal/power_actor/power_actor.h
+++ b/drivers/thermal/power_actor/power_actor.h
@@ -17,11 +17,16 @@
 #ifndef __POWER_ACTOR_H__
 #define __POWER_ACTOR_H__
 
+#include <linux/cpumask.h>
+#include <linux/device.h>
+#include <linux/err.h>
 #include <linux/list.h>
+#include <linux/thermal.h>
 
 #define MAX_NUM_ACTORS 8
 
 enum power_actor_types {
+	POWER_ACTOR_CPU,
 };
 
 struct power_actor;
@@ -58,6 +63,24 @@ struct power_actor *power_actor_register(enum power_actor_types type,
 					u32 max_power, void *privdata);
 void power_actor_unregister(struct power_actor *actor);
 
+#ifdef CONFIG_THERMAL_POWER_ACTOR_CPU
+struct power_actor *power_cpu_actor_register(cpumask_t *cpumask,
+					struct thermal_cooling_device *cdev,
+					u32 capacitance);
+void power_cpu_actor_unregister(struct power_actor *actor);
+#else
+static inline
+struct power_actor *power_cpu_actor_register(cpumask_t *cpumask,
+					struct thermal_cooling_device *cdev,
+					u32 capacitance)
+{
+	return ERR_PTR(-ENOSYS);
+}
+static inline void power_cpu_actor_unregister(struct power_actor *actor)
+{
+}
+#endif
+
 extern struct list_head actor_list;
 
 #endif /* __POWER_ACTOR_H__ */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v2 6/7] thermal: introduce the Power Allocator governor
  2014-05-20 14:10 [RFC PATCH v2 0/7] The power allocator thermal governor Javi Merino
                   ` (4 preceding siblings ...)
  2014-05-20 14:10 ` [RFC PATCH v2 5/7] thermal: add a basic cpu power actor Javi Merino
@ 2014-05-20 14:10 ` Javi Merino
  2014-05-20 14:10 ` [RFC PATCH v2 7/7] thermal: add trace events to the power allocator governor Javi Merino
  6 siblings, 0 replies; 11+ messages in thread
From: Javi Merino @ 2014-05-20 14:10 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin,
	Punit Agrawal

The power allocator governor is a thermal governor that controls system
and device power allocation to control temperature.  Conceptually, the
implementation takes a system view of heat dissipation by managing
multiple heat sources.

This governor relies on power-aware cooling devices (power actors) to
operate.  That is, cooling devices whose thermal_cooling_device_ops
accept THERMAL_UNIT_POWER.

It uses a Proportional Integral (PI) controller driven by the
temperature of the thermal zone.  This budget is then allocated to
each cooling device that can have bearing on the temperature we are
trying to control.  It decides how much power to give each cooling
device based on the performance they are requesting.  The PI
controller ensures that the total power budget does not exceed the
control temperature.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 Documentation/thermal/power_allocator.txt |   42 +++
 drivers/thermal/Kconfig                   |   15 +
 drivers/thermal/Makefile                  |    1 +
 drivers/thermal/power_allocator.c         |  442 +++++++++++++++++++++++++++++
 drivers/thermal/thermal_core.c            |    7 +-
 drivers/thermal/thermal_core.h            |    8 +
 include/linux/thermal.h                   |    5 +
 7 files changed, 519 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/thermal/power_allocator.txt
 create mode 100644 drivers/thermal/power_allocator.c

diff --git a/Documentation/thermal/power_allocator.txt b/Documentation/thermal/power_allocator.txt
new file mode 100644
index 000000000000..daedf117611a
--- /dev/null
+++ b/Documentation/thermal/power_allocator.txt
@@ -0,0 +1,42 @@
+
+Integration of the power_allocator governor in a platform
+=========================================================
+
+Registering thermal_zone_device
+-------------------------------
+
+An estimate of the sustainable dissipatable power (in mW) should be
+provided while registering the thermal zone.  This is the maximum
+sustained power for allocation at the desired maximum temperature.
+This number can vary for different conditions, but the closed-loop of
+the controller should take care of those variations, the
+`max_dissipatable_power` should be an estimation of it.  Register your
+thermal zone with `thermal_zone_params` that have a
+`max_dissipatable_power`.  If you weren't passing any
+`thermal_zone_params`, then something like this will do:
+
+	static const struct thermal_zone_params tz_params = {
+		.max_dissipatable_power = 3500,
+	};
+
+and then pass `tz_params` as the 5th parameter to
+`thermal_zone_device_register()`
+
+Trip points
+-----------
+
+The governor requires the following two trip points:
+
+1.  "switch on" trip point: temperature above which the governor
+    control loop starts operating
+2.  "desired temperature" trip point: it should be higher than the
+    "switch on" trip point. It is the target temperature the governor
+    is controlling for.
+
+The trip points can be either active or passive.
+
+Power actors
+------------
+
+Devices controlled by this governor must be registered with the power
+actor API.  Read `power_actor.txt` for more information about them.
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 1818c4fa60b8..e5b338a7cab9 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
 	  Select this if you want to let the user space manage the
 	  platform thermals.
 
+config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
+	bool "power_allocator"
+	select THERMAL_GOV_POWER_ALLOCATOR
+	help
+	  Select this if you want to control temperature based on
+	  system and device power allocation. This governor relies on
+	  power actors to operate.
+
 endchoice
 
 config THERMAL_GOV_FAIR_SHARE
@@ -89,6 +97,13 @@ config THERMAL_GOV_USER_SPACE
 	help
 	  Enable this to let the user space manage the platform thermals.
 
+config THERMAL_GOV_POWER_ALLOCATOR
+	bool "Power allocator thermal governor"
+	select THERMAL_POWER_ACTOR
+	help
+	  Enable this to manage platform thermals by dynamically
+	  allocating and limiting power to devices.
+
 config THERMAL_POWER_ACTOR
 	bool
 
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 878a02cab7d1..c5b47f058675 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF)		+= of-thermal.o
 thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE)	+= fair_share.o
 thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE)	+= step_wise.o
 thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE)	+= user_space.o
+thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR)	+= power_allocator.o
 
 obj-$(CONFIG_THERMAL_POWER_ACTOR) += power_actor/
 
diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
new file mode 100644
index 000000000000..836c834a898c
--- /dev/null
+++ b/drivers/thermal/power_allocator.c
@@ -0,0 +1,442 @@
+/*
+ * A power allocator to manage temperature
+ *
+ * Copyright (C) 2014 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) "Power allocator: " fmt
+
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/thermal.h>
+
+#include "power_actor/power_actor.h"
+#include "thermal_core.h"
+
+#define FRAC_BITS 8
+#define int_to_frac(x) ((x) << FRAC_BITS)
+#define frac_to_int(x) ((x) >> FRAC_BITS)
+
+/**
+ * mul_frac - multiply two fixed-point numbers
+ * @x:	first multiplicand
+ * @y:	second multiplicand
+ *
+ * Returns the result of multiplying two fixed-point numbers.  The
+ * result is also a fixed-point number.
+ */
+static inline s64 mul_frac(s64 x, s64 y)
+{
+	return (x * y) >> FRAC_BITS;
+}
+
+enum power_allocator_trip_levels {
+	TRIP_SWITCH_ON = 0,	/* Switch on PI controller */
+	TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
+};
+
+/**
+ * struct power_allocator_params - parameters for the power allocator governor
+ * @k_po:	P parameter of the PI controller when overshooting (i.e., when
+ *		temperature is below the target)
+ * @k_pi:	P parameter of the PI controller when undershooting
+ * @k_i:	I parameter of the PI controller
+ * @integral_cutoff:	threshold below which the error is no longer accumulated
+			in the PI controller
+ * @err_integral:	Accumulated error in the PI controller.
+ */
+struct power_allocator_params {
+	s32 k_po;
+	s32 k_pu;
+	s32 k_i;
+	s32 integral_cutoff;
+	s32 err_integral;
+};
+
+/**
+ * pi_controller() - PI controller
+ * @tz:	thermal zone we are operating in
+ * @control_temp:	The target temperature
+ * @max_allocatable_power:	maximum allocatable power for this thermal zone
+ *
+ * This PI controller increases the available power budget so that the
+ * temperature of the thermal zone gets as close as possible to
+ * @control_temp and limits the power if it exceeds it.  k_po is the
+ * proportional term when we are overshooting, k_pu is the
+ * proportional term when we are undershooting.  integral_cutoff is a
+ * threshold below which we stop accumulating the error.  The
+ * accumulated error is only valid if the requested power will make
+ * the system warmer.  If the system is mostly idle, there's no point
+ * in accumulating positive error.
+ *
+ * It returns the power budget for the next period.
+ */
+static u32 pi_controller(struct thermal_zone_device *tz,
+			unsigned long current_temp, unsigned long control_temp,
+			unsigned long max_allocatable_power)
+{
+	s64 p, i, power_range;
+	s32 err;
+	struct power_allocator_params *params = tz->governor_data;
+
+	err = ((s32)control_temp - (s32)current_temp) / 1000;
+	err = int_to_frac(err);
+
+	/* Calculate the proportional term */
+	p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
+
+	/*
+	 * Calculate the integral term
+	 *
+	 * if the error s less than cut off allow integration (but
+	 * the integral is limited to max power)
+	 */
+	i = mul_frac(params->k_i, params->err_integral);
+
+	if (err < int_to_frac(params->integral_cutoff)) {
+		s64 tmpi = mul_frac(params->k_i, err);
+		tmpi += i;
+		if (tmpi <= int_to_frac(max_allocatable_power)) {
+			i = tmpi;
+			params->err_integral += err;
+		}
+	}
+
+	power_range = p + i;
+
+	/* feed-forward the known maximum dissipatable power */
+	power_range = tz->tzp->max_dissipatable_power +
+		frac_to_int(power_range);
+
+	return clamp(power_range, (s64)0, (s64)max_allocatable_power);
+}
+
+/**
+ * divvy_up_power - divvy the allocated power between the actors
+ * @req_power:	each actor's requested power
+ * @max_power:	each actor's maximum available power
+ * @num_actors:	size of the @req_power, @max_power and @granted_power's array
+ * @total_req_power: sum of @req_power
+ * @power_range:	total allocated power
+ * @granted_power:	ouput array: each actor's granted power
+ *
+ * This function divides the total allocated power (@power_range)
+ * fairly between the actors. It first tries to give each actor a
+ * share of the @power_range according to how much power it requested
+ * compared to the rest of the actors.  For example, if only one actor
+ * requests power, then it receives all the @power_range.  If
+ * three actors each requests 1mW, each receives a third of the
+ * @power_range.
+ *
+ * If any actor received more than their maximum power, then that
+ * surplus is re-divvied among the actors based on how far they are
+ * from their respective maximums.
+ *
+ * Granted power for each actor is written to @granted_power, which
+ * should've been allocated by the calling function.
+ */
+static void divvy_up_power(unsigned long *req_power,
+			unsigned long *max_power,
+			int num_actors, unsigned long total_req_power,
+			u32 power_range,
+			unsigned long *granted_power)
+{
+	unsigned long extra_power, capped_extra_power;
+	unsigned long extra_actor_power[num_actors];
+	int i;
+
+	if (!total_req_power) {
+		/*
+		 * Nobody requested anything, so just give everybody
+		 * the maximum power
+		 */
+		for (i = 0; i < num_actors; i++)
+			granted_power[i] = max_power[i];
+
+		return;
+	}
+
+	capped_extra_power = 0;
+	extra_power = 0;
+	for (i = 0; i < num_actors; i++) {
+		u64 req_range = req_power[i] * power_range;
+
+		granted_power[i] = div_u64(req_range, total_req_power);
+
+		if (granted_power[i] > max_power[i]) {
+			extra_power += granted_power[i] - max_power[i];
+			granted_power[i] = max_power[i];
+		}
+
+		extra_actor_power[i] = max_power[i] - granted_power[i];
+		capped_extra_power += extra_actor_power[i];
+	}
+
+	if (!extra_power)
+		return;
+
+	/*
+	 * Re-divvy the reclaimed extra among actors based on
+	 * how far they are from the max
+	 */
+	extra_power = min(extra_power, capped_extra_power);
+	if (capped_extra_power > 0)
+		for (i = 0; i < num_actors; i++)
+			granted_power[i] += (extra_actor_power[i] *
+					extra_power) / capped_extra_power;
+}
+
+static int allocate_power(struct thermal_zone_device *tz,
+			unsigned long current_temp, unsigned long control_temp)
+{
+	struct power_actor *actor;
+	unsigned long *req_power, *max_power, *granted_power;
+	unsigned long total_req_power, max_allocatable_power;
+	u32 power_range;
+	int i, num_actors, ret = 0;
+
+	mutex_lock(&tz->lock);
+
+	num_actors = 0;
+	list_for_each_entry(actor, &actor_list, actor_node)
+		num_actors++;
+
+	req_power = devm_kcalloc(&tz->device, num_actors, sizeof(*req_power),
+				GFP_KERNEL);
+	if (!req_power) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+
+	max_power = devm_kcalloc(&tz->device, num_actors, sizeof(*max_power),
+				GFP_KERNEL);
+	if (!max_power) {
+		ret = -ENOMEM;
+		goto free_req_power;
+	}
+
+	granted_power = devm_kcalloc(&tz->device, num_actors,
+				sizeof(*granted_power), GFP_KERNEL);
+	if (!granted_power) {
+		ret = -ENOMEM;
+		goto free_max_power;
+	}
+
+	i = 0;
+	total_req_power = 0;
+	max_allocatable_power = 0;
+
+	list_for_each_entry(actor, &actor_list, actor_node) {
+		req_power[i] = actor->ops->get_req_power(actor);
+		total_req_power += req_power[i];
+
+		max_power[i] = actor->max_power;
+		max_allocatable_power += max_power[i];
+
+		i++;
+	}
+
+	power_range = pi_controller(tz, current_temp, control_temp,
+				max_allocatable_power);
+
+	divvy_up_power(req_power, max_power, num_actors, total_req_power,
+		power_range, granted_power);
+
+	i = 0;
+	list_for_each_entry(actor, &actor_list, actor_node) {
+		BUG_ON(granted_power[i] > actor->max_power);
+
+		actor->ops->set_power(actor, granted_power[i]);
+		i++;
+	}
+
+	devm_kfree(&tz->device, granted_power);
+free_max_power:
+	devm_kfree(&tz->device, max_power);
+free_req_power:
+	devm_kfree(&tz->device, req_power);
+unlock:
+	mutex_unlock(&tz->lock);
+
+	return ret;
+}
+
+static int check_trips(struct thermal_zone_device *tz)
+{
+	int ret;
+	enum thermal_trip_type type;
+
+	if (tz->trips < 2)
+		return -EINVAL;
+
+	ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
+	if (ret)
+		return ret;
+
+	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
+		return -EINVAL;
+
+	ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
+	if (ret)
+		return ret;
+
+	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
+		return -EINVAL;
+
+	return ret;
+}
+
+static void reset_pi_controller(struct power_allocator_params *params)
+{
+	params->err_integral = 0;
+}
+
+static void allow_maximum_power(void)
+{
+	struct power_actor *actor;
+
+	list_for_each_entry(actor, &actor_list, actor_node)
+		actor->ops->set_power(actor, actor->max_power);
+}
+
+/**
+ * power_allocator_bind - bind the power_allocator governor to a thermal zone
+ * @tz:	thermal zone to bind it to
+ *
+ * Check that the thermal zone is valid for this governor: has two
+ * thermal trips.  If so, initialize the PI controller parameters and
+ * bind it to the thermal zone.
+ *
+ * Returns 0 on success, -EINVAL if the trips were invalid or -ENOMEM
+ * if we ran out of memory.
+ */
+static int power_allocator_bind(struct thermal_zone_device *tz)
+{
+	int ret;
+	struct power_allocator_params *params;
+	unsigned long switch_on_temp, control_temp;
+	u32 temperature_threshold;
+
+	ret = check_trips(tz);
+	if (ret) {
+		dev_err(&tz->device,
+			"thermal zone %s has the wrong number of trips for this governor\n",
+			tz->type);
+		return ret;
+	}
+
+	if (!tz->tzp || !tz->tzp->max_dissipatable_power) {
+		dev_err(&tz->device,
+			"Failed to bind the power_allocator governor: no max_dissipatable_power parameter\n");
+		return -EINVAL;
+	}
+
+	params = devm_kzalloc(&tz->device, sizeof(*params), GFP_KERNEL);
+	if (!params)
+		return -ENOMEM;
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
+	if (ret)
+		goto free;
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
+				&control_temp);
+	if (ret)
+		goto free;
+
+	temperature_threshold = (control_temp - switch_on_temp) / 1000;
+
+	params->k_po = int_to_frac(tz->tzp->max_dissipatable_power) /
+		temperature_threshold;
+	params->k_pu = int_to_frac(2 * tz->tzp->max_dissipatable_power) /
+		temperature_threshold;
+	params->k_i = int_to_frac(10);
+	params->integral_cutoff = 0;
+
+	reset_pi_controller(params);
+
+	tz->governor_data = params;
+
+	return 0;
+
+free:
+	devm_kfree(&tz->device, params);
+	return ret;
+}
+
+static void power_allocator_unbind(struct thermal_zone_device *tz)
+{
+	dev_dbg(&tz->device, "Unbinding from thermal zone %d\n", tz->id);
+	devm_kfree(&tz->device, tz->governor_data);
+	tz->governor_data = NULL;
+}
+
+static int power_allocator_throttle(struct thermal_zone_device *tz, int trip)
+{
+	int ret;
+	unsigned long switch_on_temp, control_temp, current_temp;
+	struct power_allocator_params *params = tz->governor_data;
+
+	/*
+	 * We get called for every trip point but we only need to do
+	 * our calculations once
+	 */
+	if (trip != TRIP_MAX_DESIRED_TEMPERATURE)
+		return 0;
+
+	ret = thermal_zone_get_temp(tz, &current_temp);
+	if (ret) {
+		dev_warn(&tz->device, "Failed to get temperature: %d\n", ret);
+		return ret;
+	}
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
+	if (ret) {
+		dev_warn(&tz->device,
+			"Failed to get switch on temperature: %d\n", ret);
+		return ret;
+	}
+
+	if (current_temp < switch_on_temp) {
+		reset_pi_controller(params);
+		allow_maximum_power();
+		return 0;
+	}
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
+				&control_temp);
+	if (ret) {
+		dev_warn(&tz->device,
+			"Failed to get the maximum desired temperature: %d\n",
+			ret);
+		return ret;
+	}
+
+	return allocate_power(tz, current_temp, control_temp);
+}
+
+static struct thermal_governor thermal_gov_power_allocator = {
+	.name		= "power_allocator",
+	.bind_to_tz	= power_allocator_bind,
+	.unbind_from_tz	= power_allocator_unbind,
+	.throttle	= power_allocator_throttle,
+};
+
+int thermal_gov_power_allocator_register(void)
+{
+	return thermal_register_governor(&thermal_gov_power_allocator);
+}
+
+void thermal_gov_power_allocator_unregister(void)
+{
+	thermal_unregister_governor(&thermal_gov_power_allocator);
+}
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 1b13d8e0cfd1..17257376396b 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1857,7 +1857,11 @@ static int __init thermal_register_governors(void)
 	if (result)
 		return result;
 
-	return thermal_gov_user_space_register();
+	result = thermal_gov_user_space_register();
+	if (result)
+		return result;
+
+	return thermal_gov_power_allocator_register();
 }
 
 static void thermal_unregister_governors(void)
@@ -1865,6 +1869,7 @@ static void thermal_unregister_governors(void)
 	thermal_gov_step_wise_unregister();
 	thermal_gov_fair_share_unregister();
 	thermal_gov_user_space_unregister();
+	thermal_gov_power_allocator_unregister();
 }
 
 static int __init thermal_init(void)
diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h
index 3db339fb636f..b24cde2c71cc 100644
--- a/drivers/thermal/thermal_core.h
+++ b/drivers/thermal/thermal_core.h
@@ -77,6 +77,14 @@ static inline int thermal_gov_user_space_register(void) { return 0; }
 static inline void thermal_gov_user_space_unregister(void) {}
 #endif /* CONFIG_THERMAL_GOV_USER_SPACE */
 
+#ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
+int thermal_gov_power_allocator_register(void);
+void thermal_gov_power_allocator_unregister(void);
+#else
+static inline int thermal_gov_power_allocator_register(void) { return 0; }
+static inline void thermal_gov_power_allocator_unregister(void) {}
+#endif /* CONFIG_THERMAL_GOV_POWER_ALLOCATOR */
+
 /* device tree support */
 #ifdef CONFIG_THERMAL_OF
 int of_parse_thermal_zones(void);
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index 06971c4779a8..1d8810e44190 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -57,6 +57,8 @@
 #define DEFAULT_THERMAL_GOVERNOR       "fair_share"
 #elif defined(CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE)
 #define DEFAULT_THERMAL_GOVERNOR       "user_space"
+#elif defined(CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR)
+#define DEFAULT_THERMAL_GOVERNOR       "power_allocator"
 #endif
 
 struct thermal_zone_device;
@@ -285,6 +287,9 @@ struct thermal_zone_params {
 
 	int num_tbps;	/* Number of tbp entries */
 	struct thermal_bind_params *tbp;
+
+	/* Maximum power (heat) that this thermal zone can dissipate in mW */
+	u32 max_dissipatable_power;
 };
 
 struct thermal_genl_event {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v2 7/7] thermal: add trace events to the power allocator governor
  2014-05-20 14:10 [RFC PATCH v2 0/7] The power allocator thermal governor Javi Merino
                   ` (5 preceding siblings ...)
  2014-05-20 14:10 ` [RFC PATCH v2 6/7] thermal: introduce the Power Allocator governor Javi Merino
@ 2014-05-20 14:10 ` Javi Merino
  6 siblings, 0 replies; 11+ messages in thread
From: Javi Merino @ 2014-05-20 14:10 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin,
	Steven Rostedt, Frederic Weisbecker, Ingo Molnar

Add trace events for the power allocator governor and the power actor
interface of the cpu cooling device.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>

---

trace-cmd needs the patched attached in
http://article.gmane.org/gmane.linux.kernel/1704423 for this to work.

 drivers/thermal/power_actor/cpu_actor.c |    5 ++++
 drivers/thermal/power_allocator.c       |   12 +++++++++-
 include/trace/events/thermal.h          |   38 +++++++++++++++++++++++++++++++
 include/trace/events/thermal_governor.h |   35 ++++++++++++++++++++++++++++
 4 files changed, 89 insertions(+), 1 deletion(-)
 create mode 100644 include/trace/events/thermal.h
 create mode 100644 include/trace/events/thermal_governor.h

diff --git a/drivers/thermal/power_actor/cpu_actor.c b/drivers/thermal/power_actor/cpu_actor.c
index 0d76d52609fa..61f1edc13ec2 100644
--- a/drivers/thermal/power_actor/cpu_actor.c
+++ b/drivers/thermal/power_actor/cpu_actor.c
@@ -27,6 +27,9 @@
 #include <linux/printk.h>
 #include <linux/slab.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/thermal.h>
+
 #include "power_actor.h"
 
 /**
@@ -188,6 +191,8 @@ static int cpu_set_power(struct power_actor *actor, u32 power)
 		return -EINVAL;
 	}
 
+	trace_thermal_power_limit(&cpu_actor->cpumask, freq, cdev_state, power);
+
 	return cdev->ops->set_cur_state(cdev, cdev_state);
 }
 
diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
index 836c834a898c..b1ebcecb1c15 100644
--- a/drivers/thermal/power_allocator.c
+++ b/drivers/thermal/power_allocator.c
@@ -19,6 +19,9 @@
 #include <linux/slab.h>
 #include <linux/thermal.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/thermal_governor.h>
+
 #include "power_actor/power_actor.h"
 #include "thermal_core.h"
 
@@ -117,7 +120,14 @@ static u32 pi_controller(struct thermal_zone_device *tz,
 	power_range = tz->tzp->max_dissipatable_power +
 		frac_to_int(power_range);
 
-	return clamp(power_range, (s64)0, (s64)max_allocatable_power);
+	power_range = clamp(power_range, (s64)0, (s64)max_allocatable_power);
+
+	trace_thermal_power_allocator_pi(frac_to_int(err),
+					frac_to_int(params->err_integral),
+					frac_to_int(p), frac_to_int(i),
+					power_range);
+
+	return power_range;
 }
 
 /**
diff --git a/include/trace/events/thermal.h b/include/trace/events/thermal.h
new file mode 100644
index 000000000000..6496da62276d
--- /dev/null
+++ b/include/trace/events/thermal.h
@@ -0,0 +1,38 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM thermal
+
+#if !defined(_TRACE_THERMAL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_THERMAL_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(thermal_power_limit,
+	TP_PROTO(const struct cpumask *cpus, unsigned int freq,
+		unsigned long cdev_state, unsigned long power),
+
+	TP_ARGS(cpus, freq, cdev_state, power),
+
+	TP_STRUCT__entry(
+		__bitmask(cpumask, num_possible_cpus())
+		__field(unsigned int,  freq      )
+		__field(unsigned long, cdev_state)
+		__field(unsigned long, power     )
+	),
+
+	TP_fast_assign(
+		__assign_bitmask(cpumask, cpumask_bits(cpus),
+				num_possible_cpus());
+		__entry->freq = freq;
+		__entry->cdev_state = cdev_state;
+		__entry->power = power;
+	),
+
+	TP_printk("cpus=%s freq=%u cdev_state=%lu power=%lu",
+		__get_bitmask(cpumask), __entry->freq, __entry->cdev_state,
+		__entry->power)
+);
+
+#endif /* _TRACE_THERMAL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/trace/events/thermal_governor.h b/include/trace/events/thermal_governor.h
new file mode 100644
index 000000000000..1fbf5c91f659
--- /dev/null
+++ b/include/trace/events/thermal_governor.h
@@ -0,0 +1,35 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM thermal_governor
+
+#if !defined(_TRACE_THERMAL_GOVERNOR_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_THERMAL_GOVERNOR_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(thermal_power_allocator_pi,
+	TP_PROTO(s32 err, s32 err_integral, s64 p, s64 i, s32 output),
+	TP_ARGS(err, err_integral, p, i, output),
+	TP_STRUCT__entry(
+		__field(s32, err         )
+		__field(s32, err_integral)
+		__field(s64, p           )
+		__field(s64, i           )
+		__field(s32, output      )
+	),
+	TP_fast_assign(
+		__entry->err = err;
+		__entry->err_integral = err_integral;
+		__entry->p = p;
+		__entry->i = i;
+		__entry->output = output;
+	),
+
+	TP_printk("err=%d err_integral=%d p=%lld i=%lld output=%d",
+		__entry->err, __entry->err_integral,
+		__entry->p, __entry->i, __entry->output)
+);
+
+#endif /* _TRACE_THERMAL_GOVERNOR_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v2 1/7] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-20 14:10 ` [RFC PATCH v2 1/7] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks Javi Merino
@ 2014-05-21  2:07   ` Steven Rostedt
  2014-05-21  8:26     ` Javi Merino
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2014-05-21  2:07 UTC (permalink / raw)
  To: Javi Merino; +Cc: linux-pm, linux-kernel, Punit.Agrawal, Linus Torvalds

Hmm, I didn't think about cross tree dependencies. I already pushed this
patch to my for-next branch which is already in linux-next, and I do not
rebase this branch unless there's a really good need to.

I guess I needed to make a separate branch that you could have pulled
separately. I'm not sure how we want to proceed, unless you wait till
Linus pulls my branch before you add this to your tree.

Maybe it would be OK to cherry pick it? I'm not sure Linus would want
that.

Maybe I can make a separate branch that only has this patch and merge it
into my tree, where git will handle the duplicate. But then we have a
strange history.

How urgent is your change? Can it wait till my stuff makes it into
Linus's tree in the 3.16 merge window?


-- Steve


On Tue, 2014-05-20 at 15:10 +0100, Javi Merino wrote:
> From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> 
> Being able to show a cpumask of events can be useful as some events
> may affect only some CPUs. There is no standard way to record the
> cpumask and converting it to a string is rather expensive during
> the trace as traces happen in hotpaths. It would be better to record
> the raw event mask and be able to parse it at print time.
> 
> The following macros were added for use with the TRACE_EVENT() macro:
> 
>   __bitmask()
>   __assign_bitmask()
>   __get_bitmask()
> 
> To test this, I added this to the sched_migrate_task event, which
> looked like this:
> 
> TRACE_EVENT(sched_migrate_task,
> 
> 	TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),
> 
> 	TP_ARGS(p, dest_cpu, cpus),
> 
> 	TP_STRUCT__entry(
> 		__array(	char,	comm,	TASK_COMM_LEN	)
> 		__field(	pid_t,	pid			)
> 		__field(	int,	prio			)
> 		__field(	int,	orig_cpu		)
> 		__field(	int,	dest_cpu		)
> 		__bitmask(	cpumask, num_possible_cpus()	)
> 	),
> 
> 	TP_fast_assign(
> 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
> 		__entry->pid		= p->pid;
> 		__entry->prio		= p->prio;
> 		__entry->orig_cpu	= task_cpu(p);
> 		__entry->dest_cpu	= dest_cpu;
> 		__assign_bitmask(cpumask, cpumask_bits(cpus), num_possible_cpus());
> 	),
> 
> 	TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
> 		  __entry->comm, __entry->pid, __entry->prio,
> 		  __entry->orig_cpu, __entry->dest_cpu,
> 		  __get_bitmask(cpumask))
> );
> 
> With the output of:
> 
>         ksmtuned-3613  [003] d..2   485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f
>      migration/1-13    [001] d..5   485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f
>              awk-3615  [002] d.H5   485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff
>      migration/2-18    [002] d..5   485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f
> 
> Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com
> Link: http://lkml.kernel.org/r/20140506132238.22e136d1@gandalf.local.home
> 
> Suggested-by: Javi Merino <javi.merino@arm.com>
> Tested-by: Javi Merino <javi.merino@arm.com>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v2 1/7] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-21  2:07   ` Steven Rostedt
@ 2014-05-21  8:26     ` Javi Merino
  2014-05-28 14:25       ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Javi Merino @ 2014-05-21  8:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Punit Agrawal, Linus Torvalds

Hi Steve,

On Wed, May 21, 2014 at 03:07:23AM +0100, Steven Rostedt wrote:
> Hmm, I didn't think about cross tree dependencies. I already pushed this
> patch to my for-next branch which is already in linux-next, and I do not
> rebase this branch unless there's a really good need to.
> 
> I guess I needed to make a separate branch that you could have pulled
> separately. I'm not sure how we want to proceed, unless you wait till
> Linus pulls my branch before you add this to your tree.
> 
> Maybe it would be OK to cherry pick it? I'm not sure Linus would want
> that.
> 
> Maybe I can make a separate branch that only has this patch and merge it
> into my tree, where git will handle the duplicate. But then we have a
> strange history.
> 
> How urgent is your change? Can it wait till my stuff makes it into
> Linus's tree in the 3.16 merge window?

Sorry for the confusion.  I was expecting you to merge this patch via
your tree.  I said that in the cover letter for the series, but I
should've repeated it in this patch.  I included it in the series for
completeness and I'll remove it once it reaches mainline.

Cheers,
Javi

> On Tue, 2014-05-20 at 15:10 +0100, Javi Merino wrote:
> > From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> > 
> > Being able to show a cpumask of events can be useful as some events
> > may affect only some CPUs. There is no standard way to record the
> > cpumask and converting it to a string is rather expensive during
> > the trace as traces happen in hotpaths. It would be better to record
> > the raw event mask and be able to parse it at print time.
> > 
> > The following macros were added for use with the TRACE_EVENT() macro:
> > 
> >   __bitmask()
> >   __assign_bitmask()
> >   __get_bitmask()
> > 
> > To test this, I added this to the sched_migrate_task event, which
> > looked like this:
> > 
> > TRACE_EVENT(sched_migrate_task,
> > 
> > 	TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),
> > 
> > 	TP_ARGS(p, dest_cpu, cpus),
> > 
> > 	TP_STRUCT__entry(
> > 		__array(	char,	comm,	TASK_COMM_LEN	)
> > 		__field(	pid_t,	pid			)
> > 		__field(	int,	prio			)
> > 		__field(	int,	orig_cpu		)
> > 		__field(	int,	dest_cpu		)
> > 		__bitmask(	cpumask, num_possible_cpus()	)
> > 	),
> > 
> > 	TP_fast_assign(
> > 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
> > 		__entry->pid		= p->pid;
> > 		__entry->prio		= p->prio;
> > 		__entry->orig_cpu	= task_cpu(p);
> > 		__entry->dest_cpu	= dest_cpu;
> > 		__assign_bitmask(cpumask, cpumask_bits(cpus), num_possible_cpus());
> > 	),
> > 
> > 	TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
> > 		  __entry->comm, __entry->pid, __entry->prio,
> > 		  __entry->orig_cpu, __entry->dest_cpu,
> > 		  __get_bitmask(cpumask))
> > );
> > 
> > With the output of:
> > 
> >         ksmtuned-3613  [003] d..2   485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f
> >      migration/1-13    [001] d..5   485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f
> >              awk-3615  [002] d.H5   485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff
> >      migration/2-18    [002] d..5   485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f
> > 
> > Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com
> > Link: http://lkml.kernel.org/r/20140506132238.22e136d1@gandalf.local.home
> > 
> > Suggested-by: Javi Merino <javi.merino@arm.com>
> > Tested-by: Javi Merino <javi.merino@arm.com>
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v2 1/7] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-21  8:26     ` Javi Merino
@ 2014-05-28 14:25       ` Steven Rostedt
  0 siblings, 0 replies; 11+ messages in thread
From: Steven Rostedt @ 2014-05-28 14:25 UTC (permalink / raw)
  To: Javi Merino
  Cc: linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Punit Agrawal, Linus Torvalds

On Wed, 21 May 2014 09:26:41 +0100
Javi Merino <javi.merino@arm.com> wrote:

> Hi Steve,
> 
> On Wed, May 21, 2014 at 03:07:23AM +0100, Steven Rostedt wrote:
> > Hmm, I didn't think about cross tree dependencies. I already pushed this
> > patch to my for-next branch which is already in linux-next, and I do not
> > rebase this branch unless there's a really good need to.
> > 
> > I guess I needed to make a separate branch that you could have pulled
> > separately. I'm not sure how we want to proceed, unless you wait till
> > Linus pulls my branch before you add this to your tree.
> > 
> > Maybe it would be OK to cherry pick it? I'm not sure Linus would want
> > that.
> > 
> > Maybe I can make a separate branch that only has this patch and merge it
> > into my tree, where git will handle the duplicate. But then we have a
> > strange history.
> > 
> > How urgent is your change? Can it wait till my stuff makes it into
> > Linus's tree in the 3.16 merge window?
> 
> Sorry for the confusion.  I was expecting you to merge this patch via
> your tree.  I said that in the cover letter for the series, but I
> should've repeated it in this patch.  I included it in the series for
> completeness and I'll remove it once it reaches mainline.
> 

At LinuxCon Japan I talked with Linus about how to handle this.

If you look in my for-next branch on my git repo that's here:

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git

You can cherry pick my commit:

https://git.kernel.org/cgit/linux/kernel/git/rostedt/linux-trace.git/commit/?h=for-next&id=4449bf927b61bdb4389393c6fea6837214d1ace7

commit 4449bf927b61bdb4389393c6fea6837214d1ace7

Since I don't have anything else built on top of it, and it is rather
independent (I just tried cherry-picking it to Linus's tree and it
comes in with no conflicts), he said it's fine if you just cherry pick
the commit into your tree and base the rest of your patches on top of
it.

Just state that you did so in the pull commit to Linus.

Now, I would use git to do this. Don't make a patch for it. Whoever you
send patches to should cherry pick my change and then apply the rest of
your patches.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-05-28 14:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-20 14:10 [RFC PATCH v2 0/7] The power allocator thermal governor Javi Merino
2014-05-20 14:10 ` [RFC PATCH v2 1/7] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks Javi Merino
2014-05-21  2:07   ` Steven Rostedt
2014-05-21  8:26     ` Javi Merino
2014-05-28 14:25       ` Steven Rostedt
2014-05-20 14:10 ` [RFC PATCH v2 2/7] thermal: document struct thermal_zone_device and thermal_governor Javi Merino
2014-05-20 14:10 ` [RFC PATCH v2 3/7] thermal: let governors have private data for each thermal zone Javi Merino
2014-05-20 14:10 ` [RFC PATCH v2 4/7] thermal: introduce the Power Actor API Javi Merino
2014-05-20 14:10 ` [RFC PATCH v2 5/7] thermal: add a basic cpu power actor Javi Merino
2014-05-20 14:10 ` [RFC PATCH v2 6/7] thermal: introduce the Power Allocator governor Javi Merino
2014-05-20 14:10 ` [RFC PATCH v2 7/7] thermal: add trace events to the power allocator governor Javi Merino

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).