[PATCH 2/3] PERF(kernel): Cleanup power events

public inbox for linux-omap@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 2/3] PERF(kernel): Cleanup power events
       [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
@ 2010-10-19 11:36 ` Thomas Renninger
  2010-10-25  6:54   ` Arjan van de Ven
                     ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Thomas Renninger @ 2010-10-19 11:36 UTC (permalink / raw)
  To: trenn
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven, Ingo Molnar

New power trace events:
power:processor_idle
power:processor_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:processor_idle

and
  power:power_frequency
is replaced with:
  power:processor_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
CC: Frank Eigler <fche@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Kevin Hilman <khilman@deeprootsystems.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-omap@vger.kernel.org
CC: rjw@sisk.pl
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Pierre Tardy <tardyp@gmail.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Tejun Heo <tj@kernel.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process.c    |    5 ++-
 arch/x86/kernel/process_64.c |    1 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   80 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   14 +++++++
 kernel/trace/power-traces.c  |    3 ++
 8 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..b6b1578 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_processor_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_processor_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -480,11 +483,9 @@ static void mwait_idle(void)
  */
 static void poll_idle(void)
 {
-	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
 }
 
 /*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..2c3254c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,7 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_processor_idle(0, smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..33bdc41 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_processor_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..f79de04 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_processor_idle(0, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..c78e496 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_processor_idle((eax >> 4) + 1, smp_processor_id());
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..d5cecd9 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,60 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(processor,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u64,		state		)
+		__field(	u64,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	     TP_ARGS(state, cpu_id)
+);
+
+DEFINE_EVENT(processor, processor_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u64,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +123,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..0b5c841 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,20 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..6b6da42 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(processor_idle);
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup " Thomas Renninger
@ 2010-10-25  6:54   ` Arjan van de Ven
  2010-10-25  9:41     ` Thomas Renninger
  2010-10-25  6:58   ` Arjan van de Ven
  2010-10-25 10:04   ` Ingo Molnar
  2 siblings, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2010-10-25  6:54 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/19/2010 4:36 AM, Thomas Renninger wrote:
>   static void poll_idle(void)
>   {
> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
>   	local_irq_enable();
>   	while (!need_resched())
>   		cpu_relax();
> -	trace_power_end(0);
>   }

why did you remove the idle tracepoints from this one ???


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup " Thomas Renninger
  2010-10-25  6:54   ` Arjan van de Ven
@ 2010-10-25  6:58   ` Arjan van de Ven
  2010-10-25 10:04   ` Ingo Molnar
  2 siblings, 0 replies; 29+ messages in thread
From: Arjan van de Ven @ 2010-10-25  6:58 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
>
>
> C-state/idle accounting events:
>    power:power_start
>    power:power_end
> are replaced with:
>    power:processor_idle
>
I think you need two trace points for this
one to enter idle
one to exit

because using magic encoding games to encode "exit"is a mistake; as can 
be seen in this patch.
You're currently trying to use "0" to signal "end of idle", but "0" is 
also a valid idle state (namely that of polling)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25  6:54   ` Arjan van de Ven
@ 2010-10-25  9:41     ` Thomas Renninger
  2010-10-25 13:55       ` Arjan van de Ven
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Renninger @ 2010-10-25  9:41 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> >   static void poll_idle(void)
> >   {
> > -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> >   	local_irq_enable();
> >   	while (!need_resched())
> >   		cpu_relax();
> > -	trace_power_end(0);
> >   }
> 
> why did you remove the idle tracepoints from this one ???
Because no idle/sleep state is entered here.
State 0 does not exist or say, it means the machine is not idle.
The new event uses idle state 0 spec conform as "exit sleep state".

If this should still be trackable some kind of dummy sleep state:
#define IDLE_BUSY_LOOP 0xFE
(or similar) must get defined and passed like this:
trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
    cpu_relax()
trace_processor_idle(0, smp_processor_id());

I could imagine this is somewhat worth it to compare idle results
to "no idle state at all" is used.
But nobody should ever use idle=poll, comparing deep sleep states
with C1 with (idle=halt) should be sufficient?

    Thomas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup " Thomas Renninger
  2010-10-25  6:54   ` Arjan van de Ven
  2010-10-25  6:58   ` Arjan van de Ven
@ 2010-10-25 10:04   ` Ingo Molnar
  2010-10-25 11:03     ` Thomas Renninger
  2 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2010-10-25 10:04 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> New power trace events:
> power:processor_idle
> power:processor_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:processor_idle

Well, most power saving hw models (and the code implementing them) have this kind of 
model:

 enter power saving mode X
 exit power saving mode

Where X is some sort of 'power saving deepness' attribute, right?

 	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 10:04   ` Ingo Molnar
@ 2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 13:58       ` Arjan van de Ven
  0 siblings, 2 replies; 29+ messages in thread
From: Thomas Renninger @ 2010-10-25 11:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > New power trace events:
> > power:processor_idle
> > power:processor_frequency
> > power:machine_suspend
> > 
> > 
> > C-state/idle accounting events:
> >   power:power_start
> >   power:power_end
> > are replaced with:
> >   power:processor_idle
> 
> Well, most power saving hw models (and the code implementing them) have this kind of 
> model:
> 
>  enter power saving mode X
>  exit power saving mode
> 
> Where X is some sort of 'power saving deepness' attribute, right?
Sure.
But ACPI and afaik this model got picked up for PCI and other (sub-)archs
as well, defines state 0 as the non-power saving mode.
Same as done here with machine suspend state (S0 is back from suspend) and
this model should get picked up when device sleep states get tracked at
some time.
It's consistent and applies to some well known specifications.

Also tracking processor_idle_{start,end} as a separate event
makes no sense and there is no need to introduce:
processor_idle_start/processor_idle_end
machine_suspend_start/machine_suspend_end
device_power_mode_start/device_power_mode_end
events.
Using state 0 as "exit/end", is much nicer for kernel/
userspace implementations/code and the user.

     Thomas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:03     ` Thomas Renninger
@ 2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 12:55         ` Thomas Renninger
  2010-10-25 12:58         ` Mathieu Desnoyers
  2010-10-25 13:58       ` Arjan van de Ven
  1 sibling, 2 replies; 29+ messages in thread
From: Ingo Molnar @ 2010-10-25 11:55 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven


* Thomas Renninger <trenn@suse.de> wrote:

> On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > New power trace events:
> > > power:processor_idle
> > > power:processor_frequency
> > > power:machine_suspend
> > > 
> > > 
> > > C-state/idle accounting events:
> > >   power:power_start
> > >   power:power_end
> > > are replaced with:
> > >   power:processor_idle
> > 
> > Well, most power saving hw models (and the code implementing them) have this kind of 
> > model:
> > 
> >  enter power saving mode X
> >  exit power saving mode
> > 
> > Where X is some sort of 'power saving deepness' attribute, right?
>
> Sure.

Which is is the 'saner' model?

> But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> defines state 0 as the non-power saving mode.

But the actual code does not actually deal with any 'state 0', does it? It enters an 
idle function and then exits it, right?

'power state' might be what is used for devices - but even there, we have:

  - enter power state X
  - exit power state

right?

> Same as done here with machine suspend state (S0 is back from suspend) and
> this model should get picked up when device sleep states get tracked at
> some time.
>
> It's consistent and applies to some well known specifications.

What we want it to be is for it to be the nicest, most understandable, most logical 
model - not one matching random hardware specifications.

( Hardware specifications only matter in so far that it should be possible to 
  express all the known hardware state transitions via these events efficiently. )

> Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> there is no need to introduce: processor_idle_start/processor_idle_end 
> machine_suspend_start/machine_suspend_end 
> device_power_mode_start/device_power_mode_end events.

What do you mean by "makes no sense"?

Are they superfluous? Inefficient? Illogical?

> Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> implementations/code and the user.

By that argument we should not have separate fork() and exit() syscalls either, but 
a set_process_state(1) and set_process_state(0) interface?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:55       ` Ingo Molnar
@ 2010-10-25 12:55         ` Thomas Renninger
  2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 12:58         ` Mathieu Desnoyers
  1 sibling, 1 reply; 29+ messages in thread
From: Thomas Renninger @ 2010-10-25 12:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Arjan van de Ven

On Monday 25 October 2010 13:55:25 Ingo Molnar wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > 
> > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > model:
> > > 
> > >  enter power saving mode X
> > >  exit power saving mode
> > > 
> > > Where X is some sort of 'power saving deepness' attribute, right?
> >
> > Sure.
> 
> Which is is the 'saner' model?
> 
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > defines state 0 as the non-power saving mode.
> 
> But the actual code does not actually deal with any 'state 0', does it?
It does. Not being idle is tracked by cpuidle driver as state 0
(arch independent):
/sys/devices/system/cpu/cpu0/cpuidle/state0/
halt/C1 on X86 is:
/sys/devices/system/cpu/cpu0/cpuidle/state1/
...

> It enters an idle function and then exits it, right?
> 'power state' might be what is used for devices - but even there, we have:
> 
>   - enter power state X
>   - exit power state
> 
> right?
That is not true for PCI, probably others as well.
There you have D0 (being the maximum powered state) up to D3.
Same for PCI Bus Power States (B0, B1, B2, and B3).

Look at drivers/pci/pci.c:pci_raw_set_power_state()
To "exit" a power state you call:
pci_raw_set_power_state(dev, PCI_D0);

Same for suspend. "Exit" suspend is:
#define PM_SUSPEND_ON           ((__force suspend_state_t) 0)
so on resume we enter suspend_state_t 0.

> > Same as done here with machine suspend state (S0 is back from suspend) and
> > this model should get picked up when device sleep states get tracked at
> > some time.
> >
> > It's consistent and applies to some well known specifications.
> 
> What we want it to be is for it to be the nicest, most understandable, most logical 
> model - not one matching random hardware specifications.
> 
> ( Hardware specifications only matter in so far that it should be possible to 
>   express all the known hardware state transitions via these events efficiently. )
> 
> > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > there is no need to introduce: processor_idle_start/processor_idle_end 
> > machine_suspend_start/machine_suspend_end 
> > device_power_mode_start/device_power_mode_end events.
> 
> What do you mean by "makes no sense"?
> 
> Are they superfluous?
Yes, you do not need two different events to track one thing.

> Illogical?
Yes, A user who wants to enable processor idle tracking does
want to enable it via:
echo power:processor_idle >/sys/kernel/debug/tracing/events/enable
what do you intend to track with a:
power:power_start
event?

    Thomas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:55       ` Ingo Molnar
  2010-10-25 12:55         ` Thomas Renninger
@ 2010-10-25 12:58         ` Mathieu Desnoyers
  2010-10-25 20:29           ` Rafael J. Wysocki
  1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2010-10-25 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Arjan van de Ven

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Thomas Renninger <trenn@suse.de> wrote:
> 
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > 
> > > * Thomas Renninger <trenn@suse.de> wrote:
> > > 
> > > > New power trace events:
> > > > power:processor_idle
> > > > power:processor_frequency
> > > > power:machine_suspend
> > > > 
> > > > 
> > > > C-state/idle accounting events:
> > > >   power:power_start
> > > >   power:power_end
> > > > are replaced with:
> > > >   power:processor_idle
> > > 
> > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > model:
> > > 
> > >  enter power saving mode X
> > >  exit power saving mode
> > > 
> > > Where X is some sort of 'power saving deepness' attribute, right?
> >
> > Sure.
> 
> Which is is the 'saner' model?
> 
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > defines state 0 as the non-power saving mode.
> 
> But the actual code does not actually deal with any 'state 0', does it? It enters an 
> idle function and then exits it, right?
> 
> 'power state' might be what is used for devices - but even there, we have:
> 
>   - enter power state X
>   - exit power state
> 
> right?
> 
> > Same as done here with machine suspend state (S0 is back from suspend) and
> > this model should get picked up when device sleep states get tracked at
> > some time.
> >
> > It's consistent and applies to some well known specifications.
> 
> What we want it to be is for it to be the nicest, most understandable, most logical 
> model - not one matching random hardware specifications.
> 
> ( Hardware specifications only matter in so far that it should be possible to 
>   express all the known hardware state transitions via these events efficiently. )
> 
> > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > there is no need to introduce: processor_idle_start/processor_idle_end 
> > machine_suspend_start/machine_suspend_end 
> > device_power_mode_start/device_power_mode_end events.
> 
> What do you mean by "makes no sense"?
> 
> Are they superfluous? Inefficient? Illogical?

I think it would require deep understanding of specific power modes of each
architecture to split into this topology. On the bright side, it would bring
clear understanding of which HW resource is being put to sleep, which would make
automated analysis much easier to do. But maybe it's too much pain compared to
the benefit. The related question is also: where is it best to put this logic ?
In the kernel code ? In per-arch TRACE_EVENT() handlers or in external trace
analysis plugins ?

> 
> > Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> > implementations/code and the user.
> 
> By that argument we should not have separate fork() and exit() syscalls either, but 
> a set_process_state(1) and set_process_state(0) interface?

I'm by no mean expert on power saving hardware specs, but if it is possible for
hardware to switch between two power saving states without passing through power
state 0, then using a "set state" rather than an enter/exit would be more
appropriate; even if we go for a scheme introducing

processor_idle_start/processor_idle_end,
machine_suspend_start/machine_suspend_end,
device_power_mode_start/device_power_mode_end.

I must defer to you guys to figure out if some hardware actually do that for
either of CPU idle, suspend or device power modes.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25  9:41     ` Thomas Renninger
@ 2010-10-25 13:55       ` Arjan van de Ven
  2010-10-25 14:36         ` Thomas Renninger
  0 siblings, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2010-10-25 13:55 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/25/2010 2:41 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
>> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
>>>    static void poll_idle(void)
>>>    {
>>> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
>>>    	local_irq_enable();
>>>    	while (!need_resched())
>>>    		cpu_relax();
>>> -	trace_power_end(0);
>>>    }
>> why did you remove the idle tracepoints from this one ???
> Because no idle/sleep state is entered here.
> State 0 does not exist or say, it means the machine is not idle.
> The new event uses idle state 0 spec conform as "exit sleep state".
>
> If this should still be trackable some kind of dummy sleep state:
> #define IDLE_BUSY_LOOP 0xFE
> (or similar) must get defined and passed like this:
> trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
>      cpu_relax()
> trace_processor_idle(0, smp_processor_id());
>
> I could imagine this is somewhat worth it to compare idle results
> to "no idle state at all" is used.
> But nobody should ever use idle=poll, comparing deep sleep states
> with C1 with (idle=halt) should be sufficient?

this is not idle=poll on the command line only.
this also gets used normally, in two cases
1) during real time operations, for some short periods of time
     (think wallstreet trading)
2) by the menu governor when the next event is less than a few 
microseconds away, so short that even C1 is too much

I know that your new API tries to use "0" as exit, but 0 is already 
taken (in all power terminology at least on x86 it is) for this.

why isn't your "exit" a special define?


also, if you look at many other similar perf events, they ever separate 
entry/exit points:

process/do_process.cpp:         
perf_events->add_event("irq:irq_handler_entry");
process/do_process.cpp:         
perf_events->add_event("irq:irq_handler_exit");
process/do_process.cpp:         perf_events->add_event("irq:softirq_entry");
process/do_process.cpp:         perf_events->add_event("irq:softirq_exit");
process/do_process.cpp:         
perf_events->add_event("timer:timer_expire_entry");
process/do_process.cpp:         
perf_events->add_event("timer:timer_expire_exit");
process/do_process.cpp:         
perf_events->add_event("timer:hrtimer_expire_entry");
process/do_process.cpp:         
perf_events->add_event("timer:hrtimer_expire_exit");
process/do_process.cpp:         perf_events->add_event("power:power_start");
process/do_process.cpp:         perf_events->add_event("power:power_end");
process/do_process.cpp:         
perf_events->add_event("workqueue:workqueue_execute_start");
process/do_process.cpp:         
perf_events->add_event("workqueue:workqueue_execute_end");

so there is already an API consistency precedent
(and frankly, trying to multiplex in "exit" via a magic value is asking 
for trouble API wise)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 11:03     ` Thomas Renninger
  2010-10-25 11:55       ` Ingo Molnar
@ 2010-10-25 13:58       ` Arjan van de Ven
  2010-10-25 20:33         ` Rafael J. Wysocki
  1 sibling, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2010-10-25 13:58 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/25/2010 4:03 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
>> * Thomas Renninger<trenn@suse.de>  wrote:
>>
>>> New power trace events:
>>> power:processor_idle
>>> power:processor_frequency
>>> power:machine_suspend
>>>
>>>
>>> C-state/idle accounting events:
>>>    power:power_start
>>>    power:power_end
>>> are replaced with:
>>>    power:processor_idle
>> Well, most power saving hw models (and the code implementing them) have this kind of
>> model:
>>
>>   enter power saving mode X
>>   exit power saving mode
>>
>> Where X is some sort of 'power saving deepness' attribute, right?
> Sure.
> But ACPI and afaik this model got picked up for PCI and other (sub-)archs
> as well, defines state 0 as the non-power saving mode.

correct ,... "C0" is not power efficient... but it's still a valid OS 
idle state!
Also tracking processor_idle_{start,end} as a separate event!

same for "S0"... S0 as standby state is still valid... sure it doesn't 
save you much power... but that does not mean it's not valid.
(as indication, the Intel Moorestown platform, which is currently in 
production and available to OEMs, has such a S0 standby state)


> makes no sense and there is no need to introduce:
> processor_idle_start/processor_idle_end
> machine_suspend_start/machine_suspend_end
> device_power_mode_start/device_power_mode_end
> events.
> Using state 0 as "exit/end", is much nicer for kernel/
> userspace implementations/code and the user.
actually no; having written a few of these in userspace so far, having a 
separate end event is easier to deal with;
the actions you take on entry and exit are complete separate code paths.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 12:55         ` Thomas Renninger
@ 2010-10-25 14:11           ` Arjan van de Ven
  2010-10-25 14:51             ` Thomas Renninger
  0 siblings, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2010-10-25 14:11 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/25/2010 5:55 AM, Thomas Renninger wrote:


>> But the actual code does not actually deal with any 'state 0', does it?
> It does. Not being idle is tracked by cpuidle driver as state 0
> (arch independent):
> /sys/devices/system/cpu/cpu0/cpuidle/state0/
> halt/C1 on X86 is:
> /sys/devices/system/cpu/cpu0/cpuidle/state1/
> ...
state0 is still OS idle!


the API is just weird for this, from a userspace perspective

if the kernel picks this state 0 for the idle handler, the userspace app 
gets
two events

one for going to state 0 to enter the idle state
one for going to state 0 to exit idle

but they're the exact same event in your API.

rather unpleasant from a userspace program perspective....
now I need to start tracking even more state on top in powertop to be 
able to make a guess at which of the two meanings a state 0 entry has.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 13:55       ` Arjan van de Ven
@ 2010-10-25 14:36         ` Thomas Renninger
  2010-10-25 14:45           ` Arjan van de Ven
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Renninger @ 2010-10-25 14:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On Monday 25 October 2010 15:55:08 Arjan van de Ven wrote:
> On 10/25/2010 2:41 AM, Thomas Renninger wrote:
> > On Monday 25 October 2010 08:54:34 Arjan van de Ven wrote:
> >> On 10/19/2010 4:36 AM, Thomas Renninger wrote:
> >>>    static void poll_idle(void)
> >>>    {
> >>> -	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
> >>>    	local_irq_enable();
> >>>    	while (!need_resched())
> >>>    		cpu_relax();
> >>> -	trace_power_end(0);
> >>>    }
> >> why did you remove the idle tracepoints from this one ???
> > Because no idle/sleep state is entered here.
> > State 0 does not exist or say, it means the machine is not idle.
> > The new event uses idle state 0 spec conform as "exit sleep state".
> >
> > If this should still be trackable some kind of dummy sleep state:
> > #define IDLE_BUSY_LOOP 0xFE
> > (or similar) must get defined and passed like this:
> > trace_processor_idle(IDLE_BUSY_LOOP, smp_processor_id());
> >      cpu_relax()
> > trace_processor_idle(0, smp_processor_id());
> >
> > I could imagine this is somewhat worth it to compare idle results
> > to "no idle state at all" is used.
> > But nobody should ever use idle=poll, comparing deep sleep states
> > with C1 with (idle=halt) should be sufficient?
> 
> this is not idle=poll on the command line only.
> this also gets used normally, in two cases
> 1) during real time operations, for some short periods of time
>      (think wallstreet trading)
> 2) by the menu governor when the next event is less than a few 
> microseconds away, so short that even C1 is too much
> 
> I know that your new API tries to use "0" as exit, but 0 is already 
> taken (in all power terminology at least on x86 it is) for this.
cpuidle indeed misuses C0 as "poll idle" state.
That's really bad/misleading, but nothing that can be changed easily.

I agree shifting C0 (cpuidle) <-> POLL_IDLE event
and              "not idle"   <-> real C0 (executing instructions)
or however this gets mapped makes things even worse.

Damn, it could be that easy and straight forward, but I agree that
this kills the approach to trigger state 0 event if C0 is entered
(C0 as defined as operational mode executing instructions).

     Thomas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:36         ` Thomas Renninger
@ 2010-10-25 14:45           ` Arjan van de Ven
  2010-10-25 14:56             ` Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2010-10-25 14:45 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Linus Torvalds, Andrew Morton, Thomas Gleixner, Masami Hiramatsu,
	Frank Eigler, Steven Rostedt, Kevin Hilman, Peter Zijlstra,
	linux-omap, rjw, linux-pm, linux-trace-users, Jean Pihet,
	Pierre Tardy, Frederic Weisbecker, Tejun Heo, Mathieu Desnoyers,
	Ingo Molnar

On 10/25/2010 7:36 AM, Thomas Renninger wrote:
>> I know that your new API tries to use "0" as exit, but 0 is already
>> taken (in all power terminology at least on x86 it is) for this.
> cpuidle indeed misuses C0 as "poll idle" state.
> That's really bad/misleading, but nothing that can be changed easily.
>
> I agree shifting C0 (cpuidle)<->  POLL_IDLE event
> and              "not idle"<->  real C0 (executing instructions)
> or however this gets mapped makes things even worse.
>
> Damn, it could be that easy and straight forward, but I agree that
> this kills the approach to trigger state 0 event if C0 is entered
> (C0 as defined as operational mode executing instructions).

ok so we have

"C0 idle"
and
"C0 no longer idle"

I'd propose using the number 0 for the first one (it makes the most 
logical sense, it's the least deep idle state etc etc)

we could use "-1" or "INT_MAX" for the later

but as a user of the API I rather like a separate "we're no longer idle" 
event... but if not, as long as things aren't ambigious I'll find a way 
to code around it.
basically with a separate event, I demultiplex based on event number 
between entry and exit.... with a special exit value I would just need a 
double demultiplex,
one on "idle" and then a second one on the state number to split between 
entry/exit.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:11           ` Arjan van de Ven
@ 2010-10-25 14:51             ` Thomas Renninger
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Renninger @ 2010-10-25 14:51 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Monday 25 October 2010 16:11:10 Arjan van de Ven wrote:
> On 10/25/2010 5:55 AM, Thomas Renninger wrote:
> 
> 
> >> But the actual code does not actually deal with any 'state 0', does it?
> > It does. Not being idle is tracked by cpuidle driver as state 0
> > (arch independent):
> > /sys/devices/system/cpu/cpu0/cpuidle/state0/
> > halt/C1 on X86 is:
> > /sys/devices/system/cpu/cpu0/cpuidle/state1/
> > ...
> state0 is still OS idle!
Yes, I just realized that.
Which is very unfortunate.
The whole cpuidle stuff is based on ACPI C-states and
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
C0
is plain wrong if it's used as "poll idle" time.
C0 is defined as (in the ACPI spec):
----------
2.5 Processor Power State Definitions
C0 Processor Power State
While the processor is in this state, it executes instructions.
----------

> the API is just weird for this, from a userspace perspective
> 
> if the kernel picks this state 0 for the idle handler, the userspace app 
> gets
> two events
> 
> one for going to state 0 to enter the idle state
> one for going to state 0 to exit idle
> 
> but they're the exact same event in your API.
> 
> rather unpleasant from a userspace program perspective....
Yeah. But the re-definition of C0 being "Linux poll idle"
will confuse users as well. Not sure whether this should get
touched, though.

Thanks for clarification, I wasn't aware of that...

    Thomas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:45           ` Arjan van de Ven
@ 2010-10-25 14:56             ` Ingo Molnar
  2010-10-25 15:48               ` Thomas Renninger
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2010-10-25 14:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Thomas Renninger, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On 10/25/2010 7:36 AM, Thomas Renninger wrote:
> >>I know that your new API tries to use "0" as exit, but 0 is already
> >>taken (in all power terminology at least on x86 it is) for this.
> >cpuidle indeed misuses C0 as "poll idle" state.
> >That's really bad/misleading, but nothing that can be changed easily.
> >
> >I agree shifting C0 (cpuidle)<->  POLL_IDLE event
> >and              "not idle"<->  real C0 (executing instructions)
> >or however this gets mapped makes things even worse.
> >
> >Damn, it could be that easy and straight forward, but I agree that
> >this kills the approach to trigger state 0 event if C0 is entered
> >(C0 as defined as operational mode executing instructions).
> 
> ok so we have
> 
> "C0 idle"
> and
> "C0 no longer idle"
> 
> I'd propose using the number 0 for the first one (it makes the most
> logical sense, it's the least deep idle state etc etc)
> 
> we could use "-1" or "INT_MAX" for the later
> 
> but as a user of the API I rather like a separate "we're no longer idle" event... 
> but if not, as long as things aren't ambigious I'll find a way to code around it.
>
> basically with a separate event, I demultiplex based on event number between entry 
> and exit.... with a special exit value I would just need a double demultiplex,

Hm, does not sound particularly smart.

> one on "idle" and then a second one on the state number to split between 
> entry/exit.

The thing is, in terms of CPU idle state, if the old tracepoints give us all the 
information that the new tracepoints, why dont we simply add the tracepoints to ARM 
and be done with it? No app needs to be changed in that case, etc.

Plus, lets express the suspend/resume tracepoints as suspend_enter(X)/suspend_exit() 
events as well, to keep it symmetric and consistent with the other enter/exit 
events.

The rename alone isnt a strong enough reason really. 'entering idle state X' and 
'exiting idle' is pretty much synonymous to 'enter idle state X'.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 14:56             ` Ingo Molnar
@ 2010-10-25 15:48               ` Thomas Renninger
  2010-10-25 16:00                 ` Arjan van de Ven
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Renninger @ 2010-10-25 15:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On Monday 25 October 2010 16:56:04 Ingo Molnar wrote:
> 
> * Arjan van de Ven <arjan@linux.intel.com> wrote:
> > On 10/25/2010 7:36 AM, Thomas Renninger wrote:
> > ok so we have
> > 
> > "C0 idle"
Ideally this should not be called C0, but expressed
as (#define) POLL_IDLE wherever possible.

In all documentations/specs/white papers about other OSes
C0 is refered to as not being idle.
Linux mis-uses it as a self-defined idle state which
is really confusing.

> > and
> > "C0 no longer idle"
> > 
> > I'd propose using the number 0 for the first one (it makes the most
> > logical sense, it's the least deep idle state etc etc)
I would use a special number for the "Linux only" state.

> > we could use "-1" or "INT_MAX" for the later
> > but as a user of the API I rather like a separate "we're no longer idle" event... 
> > but if not, as long as things aren't ambigious I'll find a way to code around it.
> >
> > basically with a separate event, I demultiplex based on event number between entry 
> > and exit.... with a special exit value I would just need a double demultiplex,
> 
> Hm, does not sound particularly smart.
> 
> > one on "idle" and then a second one on the state number to split between 
> > entry/exit.
> 
> The thing is, in terms of CPU idle state, if the old tracepoints give us all the 
> information that the new tracepoints, why dont we simply add the tracepoints to ARM 
> and be done with it? No app needs to be changed in that case, etc.
> 
> Plus, lets express the suspend/resume tracepoints as suspend_enter(X)/suspend_exit() 
> events as well, to keep it symmetric and consistent with the other enter/exit 
> events.
> 
> The rename alone isnt a strong enough reason really. 'entering idle state X' and 
> 'exiting idle' is pretty much synonymous to 'enter idle state X'.
It's not only that, my patch also:
  - eleminates the never ever used type= field
  - uses a better name, currently it's power:power_{start,end}
    How would you name another power event...

Altogether, it should justify the proposed cleanup(s).
But with this C0 clash, I am not sure whether:
  1) as Ingo said any clean up
  2) a minimal cleanup:
       - rename power:power_{start,end} to power:processor_idle{start,end}
       - get rid of type= field
  3) or a maximum cleanup:
       - plus not use start/end events, but use one state transition
         event.
should be done.
I think best is Jean goes with current definitions.
2. is far less intrusive and if you like to have it, I can
still send another patch.

     Thomas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 15:48               ` Thomas Renninger
@ 2010-10-25 16:00                 ` Arjan van de Ven
  2010-10-25 23:32                   ` Thomas Renninger
  0 siblings, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2010-10-25 16:00 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

On 10/25/2010 8:48 AM, Thomas Renninger wrote:
> On Monday 25 October 2010 16:56:04 Ingo Molnar wrote:
>> * Arjan van de Ven<arjan@linux.intel.com>  wrote:
>>> On 10/25/2010 7:36 AM, Thomas Renninger wrote:
>>> ok so we have
>>>
>>> "C0 idle"
> Ideally this should not be called C0, but expressed
> as (#define) POLL_IDLE wherever possible.
>
> In all documentations/specs/white papers about other OSes
> C0 is refered to as not being idle.
> Linux mis-uses it as a self-defined idle state which
> is really confusing.

sure naming is one thing
>>> and
>>> "C0 no longer idle"
>>>
>>> I'd propose using the number 0 for the first one (it makes the most
>>> logical sense, it's the least deep idle state etc etc)
> I would use a special number for the "Linux only" state.

that special number is 0 though..
it makes sense in ordering, 0 < 1, 1 < 2 etc



0 makes for a really bad special number for the exit marker; not just here,
but also for your suspend hook, that one definitely needs to change
(since current commercially available SOCs already reuse 0 for this for 
standby level states)



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 12:58         ` Mathieu Desnoyers
@ 2010-10-25 20:29           ` Rafael J. Wysocki
  0 siblings, 0 replies; 29+ messages in thread
From: Rafael J. Wysocki @ 2010-10-25 20:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Thomas Renninger, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Pierre Tardy, Frederic Weisbecker,
	Tejun Heo, Arjan van de Ven

On Monday, October 25, 2010, Mathieu Desnoyers wrote:
> * Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Thomas Renninger <trenn@suse.de> wrote:
> > 
> > > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> > > > 
> > > > * Thomas Renninger <trenn@suse.de> wrote:
> > > > 
> > > > > New power trace events:
> > > > > power:processor_idle
> > > > > power:processor_frequency
> > > > > power:machine_suspend
> > > > > 
> > > > > 
> > > > > C-state/idle accounting events:
> > > > >   power:power_start
> > > > >   power:power_end
> > > > > are replaced with:
> > > > >   power:processor_idle
> > > > 
> > > > Well, most power saving hw models (and the code implementing them) have this kind of 
> > > > model:
> > > > 
> > > >  enter power saving mode X
> > > >  exit power saving mode
> > > > 
> > > > Where X is some sort of 'power saving deepness' attribute, right?
> > >
> > > Sure.
> > 
> > Which is is the 'saner' model?
> > 
> > > But ACPI and afaik this model got picked up for PCI and other (sub-)archs as well, 
> > > defines state 0 as the non-power saving mode.
> > 
> > But the actual code does not actually deal with any 'state 0', does it? It enters an 
> > idle function and then exits it, right?
> > 
> > 'power state' might be what is used for devices - but even there, we have:
> > 
> >   - enter power state X
> >   - exit power state
> > 
> > right?
> > 
> > > Same as done here with machine suspend state (S0 is back from suspend) and
> > > this model should get picked up when device sleep states get tracked at
> > > some time.
> > >
> > > It's consistent and applies to some well known specifications.
> > 
> > What we want it to be is for it to be the nicest, most understandable, most logical 
> > model - not one matching random hardware specifications.
> > 
> > ( Hardware specifications only matter in so far that it should be possible to 
> >   express all the known hardware state transitions via these events efficiently. )
> > 
> > > Also tracking processor_idle_{start,end} as a separate event makes no sense and 
> > > there is no need to introduce: processor_idle_start/processor_idle_end 
> > > machine_suspend_start/machine_suspend_end 
> > > device_power_mode_start/device_power_mode_end events.
> > 
> > What do you mean by "makes no sense"?
> > 
> > Are they superfluous? Inefficient? Illogical?
> 
> I think it would require deep understanding of specific power modes of each
> architecture to split into this topology. On the bright side, it would bring
> clear understanding of which HW resource is being put to sleep, which would make
> automated analysis much easier to do. But maybe it's too much pain compared to
> the benefit. The related question is also: where is it best to put this logic ?
> In the kernel code ? In per-arch TRACE_EVENT() handlers or in external trace
> analysis plugins ?
> 
> > 
> > > Using state 0 as "exit/end", is much nicer for kernel/ userspace 
> > > implementations/code and the user.
> > 
> > By that argument we should not have separate fork() and exit() syscalls either, but 
> > a set_process_state(1) and set_process_state(0) interface?
> 
> I'm by no mean expert on power saving hardware specs, but if it is possible for
> hardware to switch between two power saving states without passing through power
> state 0, then using a "set state" rather than an enter/exit would be more
> appropriate; even if we go for a scheme introducing
> 
> processor_idle_start/processor_idle_end,
> machine_suspend_start/machine_suspend_end,
> device_power_mode_start/device_power_mode_end.
> 
> I must defer to you guys to figure out if some hardware actually do that for
> either of CPU idle, suspend or device power modes.

Yes, you can go directly from PCI_D1 to PCI_D2, for one example.

Apart from this, attempting to put system suspend to the same bag as cpuidle
is not going to work in the long run.  They are _fundamentally_ different things
event though the power state we get into as a result of suspend is approximately
the same as we can get into via cpuidle (even in that case the energy savings
will generally be different in both cases due to wakeup events).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 13:58       ` Arjan van de Ven
@ 2010-10-25 20:33         ` Rafael J. Wysocki
  0 siblings, 0 replies; 29+ messages in thread
From: Rafael J. Wysocki @ 2010-10-25 20:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Thomas Renninger, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Masami Hiramatsu, Frank Eigler, Steven Rostedt,
	Kevin Hilman, Peter Zijlstra, linux-omap, linux-pm,
	linux-trace-users, Jean Pihet, Pierre Tardy, Frederic Weisbecker,
	Tejun Heo, Mathieu Desnoyers

On Monday, October 25, 2010, Arjan van de Ven wrote:
> On 10/25/2010 4:03 AM, Thomas Renninger wrote:
> > On Monday 25 October 2010 12:04:28 Ingo Molnar wrote:
> >> * Thomas Renninger<trenn@suse.de>  wrote:
> >>
> >>> New power trace events:
> >>> power:processor_idle
> >>> power:processor_frequency
> >>> power:machine_suspend
> >>>
> >>>
> >>> C-state/idle accounting events:
> >>>    power:power_start
> >>>    power:power_end
> >>> are replaced with:
> >>>    power:processor_idle
> >> Well, most power saving hw models (and the code implementing them) have this kind of
> >> model:
> >>
> >>   enter power saving mode X
> >>   exit power saving mode
> >>
> >> Where X is some sort of 'power saving deepness' attribute, right?
> > Sure.
> > But ACPI and afaik this model got picked up for PCI and other (sub-)archs
> > as well, defines state 0 as the non-power saving mode.
> 
> correct ,... "C0" is not power efficient... but it's still a valid OS 
> idle state!
> Also tracking processor_idle_{start,end} as a separate event!
> 
> same for "S0"... S0 as standby state is still valid... sure it doesn't 
> save you much power... but that does not mean it's not valid.

If you mean ACPI S0, it is not a standby state.  It actually is the full-power
state.

> (as indication, the Intel Moorestown platform, which is currently in 
> production and available to OEMs, has such a S0 standby state)

Another naming confusion.  How smart.

> > makes no sense and there is no need to introduce:
> > processor_idle_start/processor_idle_end
> > machine_suspend_start/machine_suspend_end
> > device_power_mode_start/device_power_mode_end
> > events.
> > Using state 0 as "exit/end", is much nicer for kernel/
> > userspace implementations/code and the user.
> actually no; having written a few of these in userspace so far, having a 
> separate end event is easier to deal with;
> the actions you take on entry and exit are complete separate code paths.

That's correct, unless you go directly from one low-power state to another
(which is possible for example for PCI).  We don't do that at the moment,
but it's possible in principle and we may want to start doing that at one
point.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-25 16:00                 ` Arjan van de Ven
@ 2010-10-25 23:32                   ` Thomas Renninger
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Renninger @ 2010-10-25 23:32 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Masami Hiramatsu, Frank Eigler, Steven Rostedt, Kevin Hilman,
	Peter Zijlstra, linux-omap, rjw, linux-pm, linux-trace-users,
	Jean Pihet, Pierre Tardy, Frederic Weisbecker, Tejun Heo,
	Mathieu Desnoyers

@Ingo: Can you queue up 1/3, it's an independent fix.

On Monday 25 October 2010 06:00:17 pm Arjan van de Ven wrote:
> On 10/25/2010 8:48 AM, Thomas Renninger wrote:
> 
> sure naming is one thing
Yes it should get renamed to not show:
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
C0
This is wrong and confusing

> >>> and
> >>> "C0 no longer idle"
> >>>
> >>> I'd propose using the number 0 for the first one (it makes the most
> >>> logical sense, it's the least deep idle state etc etc)
> > I would use a special number for the "Linux only" state.
> 
> that special number is 0 though..
> it makes sense in ordering, 0 < 1, 1 < 2 etc
As long as it stays a kernel and perf processor_idle internal number
it does not hurt.
But userspace tools catching the perf idle event of state 0 should never
refer to it as processor idle state 0 (or even worse C0).
Instead they should try to get the name/description of:
/sys/../state0/name
or directly refer to it as "poll idle" state.

Processor idle state C0 is not only defined as "not being idle" in the
specs, also turbostat and cpufreq-aperf use it correctly and refer to C0 when 
they show accounted "not idle" time.

Encouraged by your suggestions I send another version.
It's not a big deal to send 0xFFFFFFFF instead of 0 as "non power saving" 
state. If you can handle compatibility with it in powertop, it doesn't make 
things more complicated in kernel and perf timechart as I first thought it 
does.

      Thomas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Cleanup and enhance power trace events
@ 2010-10-28  9:02 Thomas Renninger
  2010-10-28  9:02 ` [PATCH 1/3] PERF: Do not export power_frequency, but power_start event Thomas Renninger
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Thomas Renninger @ 2010-10-28  9:02 UTC (permalink / raw)
  To: trenn, linux-omap, linux-pm, linux-trace-users, jean.pihet, arjan,
	mingo, rjw

Tested with:
acpi_idle and intel_idle as cpuidle drivers.
also tested perf timechart userspace tool without the new interface
and it still works fine with the new kernel changes.

There are quite some issues with the balance of cpu_idle
enter/leave a sleep state, but this was the case already.
perf timechart handles double "leave idle" events gracefully (should
still get fixed up at some time).

The first patch makes intel_idle cpuidle driver tracable, even
if compiled as a module.

Ingo: Can you please push these into an approprate git tree/branch.

Thanks,

    Thomas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/3] PERF: Do not export power_frequency, but power_start event
  2010-10-28  9:02 Cleanup and enhance power trace events Thomas Renninger
@ 2010-10-28  9:02 ` Thomas Renninger
  2010-10-28  9:02 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
  2010-10-28  9:02 ` [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new " Thomas Renninger
  2 siblings, 0 replies; 29+ messages in thread
From: Thomas Renninger @ 2010-10-28  9:02 UTC (permalink / raw)
  To: trenn, linux-omap, linux-pm, linux-trace-users, jean.pihet, arjan,
	mingo, rjw

power_frequency moved to drivers/cpufreq/cpufreq.c which has
to be compiled in, no need to export it.

intel_idle can a be module though...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: linux-omap@vger.kernel.org
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl
---
 drivers/idle/intel_idle.c   |    2 --
 kernel/trace/power-traces.c |    2 +-
 2 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index c37ef64..21ac077 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -201,9 +201,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 	kt_before = ktime_get_real();
 
 	stop_critical_timings();
-#ifndef MODULE
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
-#endif
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index a22582a..0e0497d 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,5 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
-EXPORT_TRACEPOINT_SYMBOL_GPL(power_frequency);
+EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
 
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-28  9:02 Cleanup and enhance power trace events Thomas Renninger
  2010-10-28  9:02 ` [PATCH 1/3] PERF: Do not export power_frequency, but power_start event Thomas Renninger
@ 2010-10-28  9:02 ` Thomas Renninger
  2010-10-28 11:17   ` Rafael J. Wysocki
  2010-10-28  9:02 ` [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new " Thomas Renninger
  2 siblings, 1 reply; 29+ messages in thread
From: Thomas Renninger @ 2010-10-28  9:02 UTC (permalink / raw)
  To: trenn, linux-omap, linux-pm, linux-trace-users, jean.pihet, arjan,
	mingo, rjw

Recent changes:
  - Enable EVENT_POWER_TRACING_DEPRECATED by default

New power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:cpu_idle

and
  power:power_frequency
is replaced with:
  power:cpu_frequency

power:machine_suspend
is newly introduced, a first implementation
comes from the ARM side, but it's easy to add these events
in X86 as well if needed.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
CC: linux-omap@vger.kernel.org
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl
---
 arch/x86/kernel/process.c    |    7 +++-
 arch/x86/kernel/process_32.c |    2 +-
 arch/x86/kernel/process_64.c |    2 +
 drivers/cpufreq/cpufreq.c    |    1 +
 drivers/cpuidle/cpuidle.c    |    1 +
 drivers/idle/intel_idle.c    |    1 +
 include/trace/events/power.h |   87 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/Kconfig         |   15 +++++++
 kernel/trace/power-traces.c  |    3 +
 9 files changed, 116 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..155d975 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_cpu_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_cpu_idle(0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..4b9befa 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -113,8 +113,8 @@ void cpu_idle(void)
 			stop_critical_timings();
 			pm_idle();
 			start_critical_timings();
-
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..28153a9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT,
+				       smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..ed4919e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs *freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..08d5f05 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..d3701bf 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_cpu_idle((eax >> 4) + 1, cpu);
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..f10de41 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,67 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(cpu,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(cpu, cpu_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id)
+);
+
+/* This file can get included multiple times, TRACE_HEADER_MULTI_READ at top */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
+
+#define PWR_EVENT_EXIT -1
+
+#endif
+
+DEFINE_EVENT(cpu, cpu_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +130,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) {};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..8ccbedd 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,21 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	default y
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..f55fcf6 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
 
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new power events
  2010-10-28  9:02 Cleanup and enhance power trace events Thomas Renninger
  2010-10-28  9:02 ` [PATCH 1/3] PERF: Do not export power_frequency, but power_start event Thomas Renninger
  2010-10-28  9:02 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
@ 2010-10-28  9:02 ` Thomas Renninger
  2010-10-28  9:19   ` Thomas Renninger
  2 siblings, 1 reply; 29+ messages in thread
From: Thomas Renninger @ 2010-10-28  9:02 UTC (permalink / raw)
  To: trenn, linux-omap, linux-pm, linux-trace-users, jean.pihet, arjan,
	mingo, rjw

Recent changes:
   - Adjust state/cpuid to u32 as done in the kernel

The transition was rather smooth, only part I had to fiddle
some time was the check whether a tracepoint/event is
supported by the running kernel.

builtin-timechart must only pass -e power:xy events which
are supported by the running kernel.
For this I added the tiny helper function:
int is_valid_tracepoint(const char *event_string)
to parse-events.[hc]
which could be more generic as an interface and support
hardware/software/... events, not only tracepoints, but someone
else could extend that if needed...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: linux-omap@vger.kernel.org
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl
---
 tools/perf/builtin-timechart.c |   91 ++++++++++++++++++++++++++++++++-------
 tools/perf/util/parse-events.c |   41 ++++++++++++++++++
 tools/perf/util/parse-events.h |    1 +
 3 files changed, 116 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 9bcc38f..1d15228 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -32,6 +32,10 @@
 #include "util/session.h"
 #include "util/svghelper.h"
 
+#define SUPPORT_OLD_POWER_EVENTS 1
+#define PWR_EVENT_EXIT -1
+
+
 static char		const *input_name = "perf.data";
 static char		const *output_name = "output.svg";
 
@@ -298,12 +302,21 @@ struct trace_entry {
 	int			lock_depth;
 };
 
-struct power_entry {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static int use_old_power_events;
+struct power_entry_old {
 	struct trace_entry te;
 	u64	type;
 	u64	value;
 	u64	cpu_id;
 };
+#endif
+
+struct power_processor_entry {
+	struct trace_entry te;
+	u32	state;
+	u32	cpu_id;
+};
 
 #define TASK_COMM_LEN 16
 struct wakeup_entry {
@@ -489,29 +502,48 @@ static int process_sample_event(event_t *event, struct perf_session *session)
 	te = (void *)data.raw_data;
 	if (session->sample_type & PERF_SAMPLE_RAW && data.raw_size > 0) {
 		char *event_str;
-		struct power_entry *pe;
-
-		pe = (void *)te;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		struct power_entry_old *peo;
+		peo = (void *)te;
+#endif
 
 		event_str = perf_header__find_event(te->type);
 
 		if (!event_str)
 			return 0;
 
-		if (strcmp(event_str, "power:power_start") == 0)
-			c_state_start(pe->cpu_id, data.time, pe->value);
-
-		if (strcmp(event_str, "power:power_end") == 0)
-			c_state_end(pe->cpu_id, data.time);
+		if (strcmp(event_str, "power:cpu_idle") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			if (ppe->state == (u32)PWR_EVENT_EXIT)
+				c_state_end(ppe->cpu_id, data.time);
+			else
+				c_state_start(ppe->cpu_id, data.time,
+					      ppe->state);
+		}
 
-		if (strcmp(event_str, "power:power_frequency") == 0)
-			p_state_change(pe->cpu_id, data.time, pe->value);
+		else if (strcmp(event_str, "power:cpu_frequency") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			p_state_change(ppe->cpu_id, data.time, ppe->state);
+		}
 
-		if (strcmp(event_str, "sched:sched_wakeup") == 0)
+		else if (strcmp(event_str, "sched:sched_wakeup") == 0)
 			sched_wakeup(data.cpu, data.time, data.pid, te);
 
-		if (strcmp(event_str, "sched:sched_switch") == 0)
+		else if (strcmp(event_str, "sched:sched_switch") == 0)
 			sched_switch(data.cpu, data.time, te);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		if (use_old_power_events) {
+			if (strcmp(event_str, "power:power_start") == 0)
+				c_state_start(peo->cpu_id, data.time, peo->value);
+
+			else if (strcmp(event_str, "power:power_end") == 0)
+				c_state_end(peo->cpu_id, data.time);
+
+			else if (strcmp(event_str, "power:power_frequency") == 0)
+				p_state_change(peo->cpu_id, data.time, peo->value);
+		}
+#endif
 	}
 	return 0;
 }
@@ -968,7 +1000,8 @@ static const char * const timechart_usage[] = {
 	NULL
 };
 
-static const char *record_args[] = {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static const char *record_old_args[] = {
 	"record",
 	"-a",
 	"-R",
@@ -980,16 +1013,40 @@ static const char *record_args[] = {
 	"-e", "sched:sched_wakeup",
 	"-e", "sched:sched_switch",
 };
+#endif
+
+static const char *record_new_args[] = {
+	"record",
+	"-a",
+	"-R",
+	"-f",
+	"-c", "1",
+	"-e", "power:cpu_frequency",
+	"-e", "power:cpu_idle",
+	"-e", "sched:sched_wakeup",
+	"-e", "sched:sched_switch",
+};
 
 static int __cmd_record(int argc, const char **argv)
 {
 	unsigned int rec_argc, i, j;
 	const char **rec_argv;
-
-	rec_argc = ARRAY_SIZE(record_args) + argc - 1;
+	const char **record_args = record_new_args;
+	unsigned int record_elems = ARRAY_SIZE(record_new_args);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+	if (!is_valid_tracepoint("power:cpu_idle") &&
+	    is_valid_tracepoint("power:power_start")) {
+		use_old_power_events = 1;
+		record_args = record_old_args;
+		record_elems = ARRAY_SIZE(record_old_args);
+	}
+#endif
+	
+	rec_argc = record_elems + argc - 1;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
-	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+	for (i = 0; i < record_elems; i++)
 		rec_argv[i] = strdup(record_args[i]);
 
 	for (j = 1; j < (unsigned int)argc; j++, i++)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4af5bd5..35e3dea 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -906,6 +906,47 @@ static void print_tracepoint_events(void)
 }
 
 /*
+ * Check whether event is in <debugfs_mount_point>/tracing/events
+ */
+
+int is_valid_tracepoint(const char *event_string)
+{
+	DIR *sys_dir, *evt_dir;
+	struct dirent *sys_next, *evt_next, sys_dirent, evt_dirent;
+	char evt_path[MAXPATHLEN];
+	char dir_path[MAXPATHLEN];
+
+	if (debugfs_valid_mountpoint(debugfs_path))
+		return 0;
+
+	sys_dir = opendir(debugfs_path);
+	if (!sys_dir)
+		return 0;
+
+	for_each_subsystem(sys_dir, sys_dirent, sys_next) {
+
+		snprintf(dir_path, MAXPATHLEN, "%s/%s", debugfs_path,
+			 sys_dirent.d_name);
+		evt_dir = opendir(dir_path);
+		if (!evt_dir)
+			continue;
+
+		for_each_event(sys_dirent, evt_dir, evt_dirent, evt_next) {
+			snprintf(evt_path, MAXPATHLEN, "%s:%s",
+				 sys_dirent.d_name, evt_dirent.d_name);
+			if (!strcmp(evt_path, event_string)) {
+				closedir(evt_dir);
+				closedir(sys_dir);
+				return 1;
+			}
+		}
+		closedir(evt_dir);
+	}
+	closedir(sys_dir);
+	return 0;
+}
+
+/*
  * Print the help text for the event symbols:
  */
 void print_events(void)
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index fc4ab3f..7ab4685 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -29,6 +29,7 @@ extern int parse_filter(const struct option *opt, const char *str, int unset);
 #define EVENTS_HELP_MAX (128*1024)
 
 extern void print_events(void);
+extern int is_valid_tracepoint(const char *event_string);
 
 extern char debugfs_path[];
 extern int valid_debugfs_mount(const char *debugfs);
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new power events
  2010-10-28  9:02 ` [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new " Thomas Renninger
@ 2010-10-28  9:19   ` Thomas Renninger
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Renninger @ 2010-10-28  9:19 UTC (permalink / raw)
  To: linux-omap; +Cc: linux-pm, linux-trace-users, jean.pihet, arjan, mingo, rjw

On Thursday 28 October 2010 11:02:59 Thomas Renninger wrote:
> Recent changes:
>    - Adjust state/cpuid to u32 as done in the kernel
> 
Argh, I forgot a guilt refresh...
Below final :) patch also fixes a bug that got introduced
in 2.6.36 already. Will also send a version for stable
inclusion.

    Thomas

----
PERF(userspace): Adjust perf timechart to the new power events

Recent changes:
   - Adjust state/cpuid to u32 as done in the kernel

The transition was rather smooth, only part I had to fiddle
some time was the check whether a tracepoint/event is
supported by the running kernel.

builtin-timechart must only pass -e power:xy events which
are supported by the running kernel.
For this I added the tiny helper function:
int is_valid_tracepoint(const char *event_string)
to parse-events.[hc]
which could be more generic as an interface and support
hardware/software/... events, not only tracepoints, but someone
else could extend that if needed...

Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: linux-omap@vger.kernel.org
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 9bcc38f..299e488 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -32,6 +32,10 @@
 #include "util/session.h"
 #include "util/svghelper.h"
 
+#define SUPPORT_OLD_POWER_EVENTS 1
+#define PWR_EVENT_EXIT -1
+
+
 static char		const *input_name = "perf.data";
 static char		const *output_name = "output.svg";
 
@@ -298,12 +302,21 @@ struct trace_entry {
 	int			lock_depth;
 };
 
-struct power_entry {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static int use_old_power_events;
+struct power_entry_old {
 	struct trace_entry te;
 	u64	type;
 	u64	value;
 	u64	cpu_id;
 };
+#endif
+
+struct power_processor_entry {
+	struct trace_entry te;
+	u32	state;
+	u32	cpu_id;
+};
 
 #define TASK_COMM_LEN 16
 struct wakeup_entry {
@@ -489,29 +502,48 @@ static int process_sample_event(event_t *event, struct perf_session *session)
 	te = (void *)data.raw_data;
 	if (session->sample_type & PERF_SAMPLE_RAW && data.raw_size > 0) {
 		char *event_str;
-		struct power_entry *pe;
-
-		pe = (void *)te;
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		struct power_entry_old *peo;
+		peo = (void *)te;
+#endif
 
 		event_str = perf_header__find_event(te->type);
 
 		if (!event_str)
 			return 0;
 
-		if (strcmp(event_str, "power:power_start") == 0)
-			c_state_start(pe->cpu_id, data.time, pe->value);
-
-		if (strcmp(event_str, "power:power_end") == 0)
-			c_state_end(pe->cpu_id, data.time);
+		if (strcmp(event_str, "power:cpu_idle") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			if (ppe->state == (u32)PWR_EVENT_EXIT)
+				c_state_end(ppe->cpu_id, data.time);
+			else
+				c_state_start(ppe->cpu_id, data.time,
+					      ppe->state);
+		}
 
-		if (strcmp(event_str, "power:power_frequency") == 0)
-			p_state_change(pe->cpu_id, data.time, pe->value);
+		else if (strcmp(event_str, "power:cpu_frequency") == 0) {
+			struct power_processor_entry *ppe = (void *)te;
+			p_state_change(ppe->cpu_id, data.time, ppe->state);
+		}
 
-		if (strcmp(event_str, "sched:sched_wakeup") == 0)
+		else if (strcmp(event_str, "sched:sched_wakeup") == 0)
 			sched_wakeup(data.cpu, data.time, data.pid, te);
 
-		if (strcmp(event_str, "sched:sched_switch") == 0)
+		else if (strcmp(event_str, "sched:sched_switch") == 0)
 			sched_switch(data.cpu, data.time, te);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+		if (use_old_power_events) {
+			if (strcmp(event_str, "power:power_start") == 0)
+				c_state_start(peo->cpu_id, data.time, peo->value);
+
+			else if (strcmp(event_str, "power:power_end") == 0)
+				c_state_end(data.cpu, data.time);
+
+			else if (strcmp(event_str, "power:power_frequency") == 0)
+				p_state_change(peo->cpu_id, data.time, peo->value);
+		}
+#endif
 	}
 	return 0;
 }
@@ -968,7 +1000,8 @@ static const char * const timechart_usage[] = {
 	NULL
 };
 
-static const char *record_args[] = {
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+static const char *record_old_args[] = {
 	"record",
 	"-a",
 	"-R",
@@ -980,16 +1013,40 @@ static const char *record_args[] = {
 	"-e", "sched:sched_wakeup",
 	"-e", "sched:sched_switch",
 };
+#endif
+
+static const char *record_new_args[] = {
+	"record",
+	"-a",
+	"-R",
+	"-f",
+	"-c", "1",
+	"-e", "power:cpu_frequency",
+	"-e", "power:cpu_idle",
+	"-e", "sched:sched_wakeup",
+	"-e", "sched:sched_switch",
+};
 
 static int __cmd_record(int argc, const char **argv)
 {
 	unsigned int rec_argc, i, j;
 	const char **rec_argv;
-
-	rec_argc = ARRAY_SIZE(record_args) + argc - 1;
+	const char **record_args = record_new_args;
+	unsigned int record_elems = ARRAY_SIZE(record_new_args);
+
+#if defined(SUPPORT_OLD_POWER_EVENTS)
+	if (!is_valid_tracepoint("power:cpu_idle") &&
+	    is_valid_tracepoint("power:power_start")) {
+		use_old_power_events = 1;
+		record_args = record_old_args;
+		record_elems = ARRAY_SIZE(record_old_args);
+	}
+#endif
+	
+	rec_argc = record_elems + argc - 1;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
-	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+	for (i = 0; i < record_elems; i++)
 		rec_argv[i] = strdup(record_args[i]);
 
 	for (j = 1; j < (unsigned int)argc; j++, i++)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4af5bd5..35e3dea 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -906,6 +906,47 @@ static void print_tracepoint_events(void)
 }
 
 /*
+ * Check whether event is in <debugfs_mount_point>/tracing/events
+ */
+
+int is_valid_tracepoint(const char *event_string)
+{
+	DIR *sys_dir, *evt_dir;
+	struct dirent *sys_next, *evt_next, sys_dirent, evt_dirent;
+	char evt_path[MAXPATHLEN];
+	char dir_path[MAXPATHLEN];
+
+	if (debugfs_valid_mountpoint(debugfs_path))
+		return 0;
+
+	sys_dir = opendir(debugfs_path);
+	if (!sys_dir)
+		return 0;
+
+	for_each_subsystem(sys_dir, sys_dirent, sys_next) {
+
+		snprintf(dir_path, MAXPATHLEN, "%s/%s", debugfs_path,
+			 sys_dirent.d_name);
+		evt_dir = opendir(dir_path);
+		if (!evt_dir)
+			continue;
+
+		for_each_event(sys_dirent, evt_dir, evt_dirent, evt_next) {
+			snprintf(evt_path, MAXPATHLEN, "%s:%s",
+				 sys_dirent.d_name, evt_dirent.d_name);
+			if (!strcmp(evt_path, event_string)) {
+				closedir(evt_dir);
+				closedir(sys_dir);
+				return 1;
+			}
+		}
+		closedir(evt_dir);
+	}
+	closedir(sys_dir);
+	return 0;
+}
+
+/*
  * Print the help text for the event symbols:
  */
 void print_events(void)
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index fc4ab3f..7ab4685 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -29,6 +29,7 @@ extern int parse_filter(const struct option *opt, const char *str, int unset);
 #define EVENTS_HELP_MAX (128*1024)
 
 extern void print_events(void);
+extern int is_valid_tracepoint(const char *event_string);
 
 extern char debugfs_path[];
 extern int valid_debugfs_mount(const char *debugfs);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-28  9:02 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
@ 2010-10-28 11:17   ` Rafael J. Wysocki
  2010-10-28 11:31     ` [linux-pm] " Rafael J. Wysocki
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2010-10-28 11:17 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: linux-omap, linux-pm, linux-trace-users, jean.pihet, arjan, mingo

On Thursday, October 28, 2010, Thomas Renninger wrote:
> Recent changes:
>   - Enable EVENT_POWER_TRACING_DEPRECATED by default
> 
> New power trace events:
> power:cpu_idle
> power:cpu_frequency
> power:machine_suspend
> 
> 
> C-state/idle accounting events:
>   power:power_start
>   power:power_end
> are replaced with:
>   power:cpu_idle
> 
> and
>   power:power_frequency
> is replaced with:
>   power:cpu_frequency
> 
> power:machine_suspend
> is newly introduced, a first implementation
> comes from the ARM side, but it's easy to add these events
> in X86 as well if needed.

Can you please check that changelog, please?

I've asked you for that already once.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [linux-pm] [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-28 11:17   ` Rafael J. Wysocki
@ 2010-10-28 11:31     ` Rafael J. Wysocki
  2010-10-28 11:37       ` Thomas Renninger
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2010-10-28 11:31 UTC (permalink / raw)
  To: linux-pm
  Cc: Thomas Renninger, jean.pihet, linux-trace-users, linux-omap,
	arjan, mingo

On Thursday, October 28, 2010, Rafael J. Wysocki wrote:
> On Thursday, October 28, 2010, Thomas Renninger wrote:
> > Recent changes:
> >   - Enable EVENT_POWER_TRACING_DEPRECATED by default
> > 
> > New power trace events:
> > power:cpu_idle
> > power:cpu_frequency
> > power:machine_suspend
> > 
> > 
> > C-state/idle accounting events:
> >   power:power_start
> >   power:power_end
> > are replaced with:
> >   power:cpu_idle
> > 
> > and
> >   power:power_frequency
> > is replaced with:
> >   power:cpu_frequency
> > 
> > power:machine_suspend
> > is newly introduced, a first implementation
> > comes from the ARM side, but it's easy to add these events
> > in X86 as well if needed.
> 
> Can you please check that changelog, please?

Sorry s/check/modify/

In fact, there won't be any ARM implementation, because it's going to be
added at the core level.

> I've asked you for that already once.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2/3] PERF(kernel): Cleanup power events
  2010-10-28 11:31     ` [linux-pm] " Rafael J. Wysocki
@ 2010-10-28 11:37       ` Thomas Renninger
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Renninger @ 2010-10-28 11:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, jean.pihet, linux-trace-users, linux-omap, arjan, mingo

Recent changes:
  - Enable EVENT_POWER_TRACING_DEPRECATED by default

New power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend


C-state/idle accounting events:
  power:power_start
  power:power_end
are replaced with:
  power:cpu_idle

and
  power:power_frequency
is replaced with:
  power:cpu_frequency

power:machine_suspend
is newly introduced.
Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.

the type= field got removed from both, it was never
used and the type is differed by the event type itself.

perf timechart
userspace tool gets adjusted in a separate patch.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
CC: linux-omap@vger.kernel.org
CC: linux-pm@lists.linux-foundation.org
CC: linux-trace-users@vger.kernel.org
CC: Jean Pihet <jean.pihet@newoldbits.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..155d975 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -374,6 +374,7 @@ void default_idle(void)
 {
 	if (hlt_use_halt()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
 		 * TS_POLLING-cleared state must be visible before we
@@ -444,6 +445,7 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	trace_cpu_idle((ax>>4)+1, smp_processor_id());
 	if (!need_resched()) {
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
@@ -460,6 +462,7 @@ static void mwait_idle(void)
 {
 	if (!need_resched()) {
 		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		trace_cpu_idle(1, smp_processor_id());
 		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
@@ -481,10 +484,12 @@ static void mwait_idle(void)
 static void poll_idle(void)
 {
 	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	trace_cpu_idle(0, smp_processor_id());
 	local_irq_enable();
 	while (!need_resched())
 		cpu_relax();
-	trace_power_end(0);
+	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..4b9befa 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -113,8 +113,8 @@ void cpu_idle(void)
 			stop_critical_timings();
 			pm_idle();
 			start_critical_timings();
-
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..28153a9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,6 +142,8 @@ void cpu_idle(void)
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
+			trace_cpu_idle(PWR_EVENT_EXIT,
+				       smp_processor_id());
 
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 199dcb9..ed4919e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -355,6 +355,7 @@ void cpufreq_notify_transition(struct cpufreq_freqs 
*freqs, unsigned int state)
 		dprintk("FREQ: %lu - CPU: %lu", (unsigned long)freqs->new,
 			(unsigned long)freqs->cpu);
 		trace_power_frequency(POWER_PSTATE, freqs->new, freqs->cpu);
+		trace_cpu_frequency(freqs->new, freqs->cpu);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..08d5f05 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -107,6 +107,7 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev);
 	trace_power_end(smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
 }
 
 /**
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 21ac077..d3701bf 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -202,6 +202,7 @@ static int intel_idle(struct cpuidle_device *dev, 
struct cpuidle_state *state)
 
 	stop_critical_timings();
 	trace_power_start(POWER_CSTATE, (eax >> 4) + 1, cpu);
+	trace_cpu_idle((eax >> 4) + 1, cpu);
 	if (!need_resched()) {
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 35a2a6e..f10de41 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -7,6 +7,67 @@
 #include <linux/ktime.h>
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(cpu,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+		__field(	u32,		cpu_id		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("state=%lu cpu_id=%lu", (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id)
+);
+
+DEFINE_EVENT(cpu, cpu_idle,
+
+	TP_PROTO(unsigned int state, unsigned int cpu_id),
+
+	TP_ARGS(state, cpu_id)
+);
+
+/* This file can get included multiple times, TRACE_HEADER_MULTI_READ 
at top */
+#ifndef _PWR_EVENT_AVOID_DOUBLE_DEFINING
+#define _PWR_EVENT_AVOID_DOUBLE_DEFINING
+
+#define PWR_EVENT_EXIT -1
+
+#endif
+
+DEFINE_EVENT(cpu, cpu_frequency,
+
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+
+	TP_ARGS(frequency, cpu_id)
+);
+
+TRACE_EVENT(machine_suspend,
+
+	TP_PROTO(unsigned int state),
+
+	TP_ARGS(state),
+
+	TP_STRUCT__entry(
+		__field(	u32,		state		)
+	),
+
+	TP_fast_assign(
+		__entry->state = state;
+	),
+
+	TP_printk("state=%lu", (unsigned long)__entry->state)
+);
+
+#ifdef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
 #ifndef _TRACE_POWER_ENUM_
 #define _TRACE_POWER_ENUM_
 enum {
@@ -69,8 +130,32 @@ TRACE_EVENT(power_end,
 	TP_printk("cpu_id=%lu", (unsigned long)__entry->cpu_id)
 
 );
-
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
 #endif /* _TRACE_POWER_H */
 
+/* Deprecated dummy functions must be protected against multi-
declartion */
+#ifndef EVENT_POWER_TRACING_DEPRECATED_PART_H
+#define EVENT_POWER_TRACING_DEPRECATED_PART_H
+
+#ifndef CONFIG_EVENT_POWER_TRACING_DEPRECATED
+
+#ifndef _TRACE_POWER_ENUM_
+#define _TRACE_POWER_ENUM_
+enum {
+	POWER_NONE = 0,
+	POWER_CSTATE = 1,
+	POWER_PSTATE = 2,
+};
+#endif
+
+static inline void trace_power_start(u64 type, u64 state, u64 cpuid) 
{};
+static inline void trace_power_end(u64 cpuid) {};
+static inline void trace_power_frequency(u64 type, u64 state, u64 
cpuid) {};
+#endif /* CONFIG_EVENT_POWER_TRACING_DEPRECATED */
+
+#endif /* EVENT_POWER_TRACING_DEPRECATED_PART_H */
+
+
+
 /* This part must be outside protection */
 #include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 538501c..8ccbedd 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -64,6 +64,21 @@ config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	bool
 
+config EVENT_POWER_TRACING_DEPRECATED
+	depends on EVENT_TRACING
+	bool
+	default y
+	help
+	  Provides old power event types:
+	  C-state/idle accounting events:
+	  power:power_start
+	  power:power_end
+	  and old cpufreq accounting event:
+	  power:power_frequency
+	  This is for userspace compatibility
+	  and will vanish after 5 kernel iterations,
+	  namely 2.6.41.
+
 config CONTEXT_SWITCH_TRACER
 	bool
 
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index 0e0497d..f55fcf6 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -13,5 +13,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/power.h>
 
+#ifdef EVENT_POWER_TRACING_DEPRECATED
 EXPORT_TRACEPOINT_SYMBOL_GPL(power_start);
+#endif
+EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-10-28 11:37 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-28  9:02 Cleanup and enhance power trace events Thomas Renninger
2010-10-28  9:02 ` [PATCH 1/3] PERF: Do not export power_frequency, but power_start event Thomas Renninger
2010-10-28  9:02 ` [PATCH 2/3] PERF(kernel): Cleanup power events Thomas Renninger
2010-10-28 11:17   ` Rafael J. Wysocki
2010-10-28 11:31     ` [linux-pm] " Rafael J. Wysocki
2010-10-28 11:37       ` Thomas Renninger
2010-10-28  9:02 ` [PATCH 3/3] PERF(userspace): Adjust perf timechart to the new " Thomas Renninger
2010-10-28  9:19   ` Thomas Renninger
     [not found] <1287488171-25303-1-git-send-email-trenn@suse.de>
2010-10-19 11:36 ` [PATCH 2/3] PERF(kernel): Cleanup " Thomas Renninger
2010-10-25  6:54   ` Arjan van de Ven
2010-10-25  9:41     ` Thomas Renninger
2010-10-25 13:55       ` Arjan van de Ven
2010-10-25 14:36         ` Thomas Renninger
2010-10-25 14:45           ` Arjan van de Ven
2010-10-25 14:56             ` Ingo Molnar
2010-10-25 15:48               ` Thomas Renninger
2010-10-25 16:00                 ` Arjan van de Ven
2010-10-25 23:32                   ` Thomas Renninger
2010-10-25  6:58   ` Arjan van de Ven
2010-10-25 10:04   ` Ingo Molnar
2010-10-25 11:03     ` Thomas Renninger
2010-10-25 11:55       ` Ingo Molnar
2010-10-25 12:55         ` Thomas Renninger
2010-10-25 14:11           ` Arjan van de Ven
2010-10-25 14:51             ` Thomas Renninger
2010-10-25 12:58         ` Mathieu Desnoyers
2010-10-25 20:29           ` Rafael J. Wysocki
2010-10-25 13:58       ` Arjan van de Ven
2010-10-25 20:33         ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox