linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] Replace trace_cpu_frequency with trace_policy_frequency
@ 2025-12-01 20:24 Samuel Wu
  2025-12-01 20:24 ` [PATCH v3 1/2] cpufreq: " Samuel Wu
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Samuel Wu @ 2025-12-01 20:24 UTC (permalink / raw)
  To: Huang Rui, Gautham R. Shenoy, Mario Limonciello, Perry Yuan,
	Jonathan Corbet, Rafael J. Wysocki, Viresh Kumar, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Srinivas Pandruvada,
	Len Brown, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, James Clark
  Cc: christian.loehle, Samuel Wu, kernel-team, linux-pm, linux-doc,
	linux-kernel, linux-trace-kernel, bpf, linux-perf-users

This series replaces the cpu_frequency trace event with a new trace event,
policy_frequency. Since by definition all CPUs in a policy are of the same
frequency, we can emit a frequency change per policy instead of per CPU.
This saves some compute and memory from the kernel side, while simplifying
analysis from the post-processing of the trace log side.

Any process that relied on cpu_frequency trace event needs to switch to the
new policy_frequency trace event in order to maintain functionality. The
decision of replacing instead of adding the trace event is intentional. Since
emitting once per policy instead of once per CPU is anyways a semantics change
that would require a tooling update, the trace event was also appropriately
renamed. The presence of the policy_frequency event in a trace log is a clear
and obvious signal for tooling to determine kernel version and which trace
event to parse.

1/2: Replaces trace_cpu_frequency with trace_policy_frequency
2/2: Corresponding documentation patch that updates references to
     cpu_frequency with policy_frequency

Changes in v3:
- Resending v2 properly (accidentally ommited cover letter in v2)

Changes in v2:
- Replaced trace_cpu_frequency with trace_policy_frequency (per Christian
  and Viresh)
- Updated references to cpu_frequency in documentation with
  policy_frequency
- v1 link: https://lore.kernel.org/all/20251112235154.2974902-1-wusamuel@google.com

Samuel Wu (2):
  cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
  cpufreq: Documentation update for trace_policy_frequency

 Documentation/admin-guide/pm/amd-pstate.rst   | 10 ++++----
 Documentation/admin-guide/pm/intel_pstate.rst | 14 +++++------
 Documentation/trace/events-power.rst          |  2 +-
 drivers/cpufreq/cpufreq.c                     | 14 ++---------
 drivers/cpufreq/intel_pstate.c                |  6 +++--
 include/trace/events/power.h                  | 24 ++++++++++++++++---
 kernel/trace/power-traces.c                   |  2 +-
 samples/bpf/cpustat_kern.c                    |  8 +++----
 samples/bpf/cpustat_user.c                    |  6 ++---
 tools/perf/builtin-timechart.c                | 12 +++++-----
 10 files changed, 54 insertions(+), 44 deletions(-)

-- 
2.52.0.107.ga0afd4fd5b-goog


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 1/2] cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
  2025-12-01 20:24 [PATCH v3 0/2] Replace trace_cpu_frequency with trace_policy_frequency Samuel Wu
@ 2025-12-01 20:24 ` Samuel Wu
  2025-12-02 12:03   ` Douglas Raillard
  2025-12-04 12:48   ` Christian Loehle
  2025-12-01 20:24 ` [PATCH v3 2/2] cpufreq: Documentation update for trace_policy_frequency Samuel Wu
  2025-12-02  6:43 ` [PATCH v3 0/2] Replace trace_cpu_frequency with trace_policy_frequency Viresh Kumar
  2 siblings, 2 replies; 14+ messages in thread
From: Samuel Wu @ 2025-12-01 20:24 UTC (permalink / raw)
  To: Huang Rui, Gautham R. Shenoy, Mario Limonciello, Perry Yuan,
	Jonathan Corbet, Rafael J. Wysocki, Viresh Kumar, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Srinivas Pandruvada,
	Len Brown, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, James Clark
  Cc: christian.loehle, Samuel Wu, kernel-team, linux-pm, linux-doc,
	linux-kernel, linux-trace-kernel, bpf, linux-perf-users

The existing cpu_frequency trace_event can be verbose, emitting a nearly
identical trace event for every CPU in the policy even when their
frequencies are identical.

This patch replaces the cpu_frequency trace event with policy_frequency
trace event, a more efficient alternative. From the kernel's
perspective, emitting a trace event once per policy instead of once per
cpu saves some memory and is less overhead. From the post-processing
perspective, analysis of the trace log is simplified without any loss of
information.

Signed-off-by: Samuel Wu <wusamuel@google.com>
---
 drivers/cpufreq/cpufreq.c      | 14 ++------------
 drivers/cpufreq/intel_pstate.c |  6 ++++--
 include/trace/events/power.h   | 24 +++++++++++++++++++++---
 kernel/trace/power-traces.c    |  2 +-
 samples/bpf/cpustat_kern.c     |  8 ++++----
 samples/bpf/cpustat_user.c     |  6 +++---
 tools/perf/builtin-timechart.c | 12 ++++++------
 7 files changed, 41 insertions(+), 31 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 4472bb1ec83c..dd3f08f3b958 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -309,8 +309,6 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
 				      struct cpufreq_freqs *freqs,
 				      unsigned int state)
 {
-	int cpu;
-
 	BUG_ON(irqs_disabled());
 
 	if (cpufreq_disabled())
@@ -344,10 +342,7 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
 		adjust_jiffies(CPUFREQ_POSTCHANGE, freqs);
 		pr_debug("FREQ: %u - CPUs: %*pbl\n", freqs->new,
 			 cpumask_pr_args(policy->cpus));
-
-		for_each_cpu(cpu, policy->cpus)
-			trace_cpu_frequency(freqs->new, cpu);
-
+		trace_policy_frequency(freqs->new, policy->cpu, policy->cpus);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 					 CPUFREQ_POSTCHANGE, freqs);
 
@@ -2201,7 +2196,6 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
 					unsigned int target_freq)
 {
 	unsigned int freq;
-	int cpu;
 
 	target_freq = clamp_val(target_freq, policy->min, policy->max);
 	freq = cpufreq_driver->fast_switch(policy, target_freq);
@@ -2213,11 +2207,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
 	arch_set_freq_scale(policy->related_cpus, freq,
 			    arch_scale_freq_ref(policy->cpu));
 	cpufreq_stats_record_transition(policy, freq);
-
-	if (trace_cpu_frequency_enabled()) {
-		for_each_cpu(cpu, policy->cpus)
-			trace_cpu_frequency(freq, cpu);
-	}
+	trace_policy_frequency(freq, policy->cpu, policy->cpus);
 
 	return freq;
 }
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index ec4abe374573..9724b5d19d83 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2297,7 +2297,8 @@ static int hwp_get_cpu_scaling(int cpu)
 
 static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate)
 {
-	trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu);
+	trace_policy_frequency(pstate * cpu->pstate.scaling, cpu->cpu,
+			       cpumask_of(cpu->cpu));
 	cpu->pstate.current_pstate = pstate;
 	/*
 	 * Generally, there is no guarantee that this code will always run on
@@ -2587,7 +2588,8 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 
 	target_pstate = get_target_pstate(cpu);
 	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
-	trace_cpu_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu);
+	trace_policy_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu,
+			       cpumask_of(cpu->cpu));
 	intel_pstate_update_pstate(cpu, target_pstate);
 
 	sample = &cpu->sample;
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 370f8df2fdb4..317098ffdd5f 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -182,11 +182,29 @@ TRACE_EVENT(pstate_sample,
 		{ PM_EVENT_RECOVER, "recover" }, \
 		{ PM_EVENT_POWEROFF, "poweroff" })
 
-DEFINE_EVENT(cpu, cpu_frequency,
+TRACE_EVENT(policy_frequency,
 
-	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
+	TP_PROTO(unsigned int frequency, unsigned int cpu_id,
+		 const struct cpumask *policy_cpus),
 
-	TP_ARGS(frequency, cpu_id)
+	TP_ARGS(frequency, cpu_id, policy_cpus),
+
+	TP_STRUCT__entry(
+		__field(u32, state)
+		__field(u32, cpu_id)
+		__cpumask(cpumask)
+	),
+
+	TP_fast_assign(
+		__entry->state = frequency;
+		__entry->cpu_id = cpu_id;
+		__assign_cpumask(cpumask, policy_cpus);
+	),
+
+	TP_printk("state=%lu cpu_id=%lu policy_cpus=%*pb",
+		  (unsigned long)__entry->state,
+		  (unsigned long)__entry->cpu_id,
+		  cpumask_pr_args((struct cpumask *)__get_dynamic_array(cpumask)))
 );
 
 TRACE_EVENT(cpu_frequency_limits,
diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
index f2fe33573e54..a537e68a6878 100644
--- a/kernel/trace/power-traces.c
+++ b/kernel/trace/power-traces.c
@@ -16,5 +16,5 @@
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(suspend_resume);
 EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
-EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_frequency);
+EXPORT_TRACEPOINT_SYMBOL_GPL(policy_frequency);
 
diff --git a/samples/bpf/cpustat_kern.c b/samples/bpf/cpustat_kern.c
index 7ec7143e2757..f485de0f89b2 100644
--- a/samples/bpf/cpustat_kern.c
+++ b/samples/bpf/cpustat_kern.c
@@ -75,9 +75,9 @@ struct {
 } pstate_duration SEC(".maps");
 
 /*
- * The trace events for cpu_idle and cpu_frequency are taken from:
+ * The trace events for cpu_idle and policy_frequency are taken from:
  * /sys/kernel/tracing/events/power/cpu_idle/format
- * /sys/kernel/tracing/events/power/cpu_frequency/format
+ * /sys/kernel/tracing/events/power/policy_frequency/format
  *
  * These two events have same format, so define one common structure.
  */
@@ -162,7 +162,7 @@ int bpf_prog1(struct cpu_args *ctx)
 	 */
 	if (ctx->state != (u32)-1) {
 
-		/* record pstate after have first cpu_frequency event */
+		/* record pstate after have first policy_frequency event */
 		if (!*pts)
 			return 0;
 
@@ -208,7 +208,7 @@ int bpf_prog1(struct cpu_args *ctx)
 	return 0;
 }
 
-SEC("tracepoint/power/cpu_frequency")
+SEC("tracepoint/power/policy_frequency")
 int bpf_prog2(struct cpu_args *ctx)
 {
 	u64 *pts, *cstate, *pstate, cur_ts, delta;
diff --git a/samples/bpf/cpustat_user.c b/samples/bpf/cpustat_user.c
index 356f756cba0d..f7e81f702358 100644
--- a/samples/bpf/cpustat_user.c
+++ b/samples/bpf/cpustat_user.c
@@ -143,12 +143,12 @@ static int cpu_stat_inject_cpu_idle_event(void)
 
 /*
  * It's possible to have no any frequency change for long time and cannot
- * get ftrace event 'trace_cpu_frequency' for long period, this introduces
+ * get ftrace event 'trace_policy_frequency' for long period, this introduces
  * big deviation for pstate statistics.
  *
  * To solve this issue, below code forces to set 'scaling_max_freq' to 208MHz
- * for triggering ftrace event 'trace_cpu_frequency' and then recovery back to
- * the maximum frequency value 1.2GHz.
+ * for triggering ftrace event 'trace_policy_frequency' and then recovery back
+ * to the maximum frequency value 1.2GHz.
  */
 static int cpu_stat_inject_cpu_frequency_event(void)
 {
diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 22050c640dfa..3ef1a2fd0493 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -612,10 +612,10 @@ process_sample_cpu_idle(struct timechart *tchart __maybe_unused,
 }
 
 static int
-process_sample_cpu_frequency(struct timechart *tchart,
-			     struct evsel *evsel,
-			     struct perf_sample *sample,
-			     const char *backtrace __maybe_unused)
+process_sample_policy_frequency(struct timechart *tchart,
+				struct evsel *evsel,
+				struct perf_sample *sample,
+				const char *backtrace __maybe_unused)
 {
 	u32 state  = evsel__intval(evsel, sample, "state");
 	u32 cpu_id = evsel__intval(evsel, sample, "cpu_id");
@@ -1541,7 +1541,7 @@ static int __cmd_timechart(struct timechart *tchart, const char *output_name)
 {
 	const struct evsel_str_handler power_tracepoints[] = {
 		{ "power:cpu_idle",		process_sample_cpu_idle },
-		{ "power:cpu_frequency",	process_sample_cpu_frequency },
+		{ "power:policy_frequency",	process_sample_policy_frequency },
 		{ "sched:sched_wakeup",		process_sample_sched_wakeup },
 		{ "sched:sched_switch",		process_sample_sched_switch },
 #ifdef SUPPORT_OLD_POWER_EVENTS
@@ -1804,7 +1804,7 @@ static int timechart__record(struct timechart *tchart, int argc, const char **ar
 	unsigned int backtrace_args_no = ARRAY_SIZE(backtrace_args);
 
 	const char * const power_args[] = {
-		"-e", "power:cpu_frequency",
+		"-e", "power:policy_frequency",
 		"-e", "power:cpu_idle",
 	};
 	unsigned int power_args_nr = ARRAY_SIZE(power_args);
-- 
2.52.0.107.ga0afd4fd5b-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 2/2] cpufreq: Documentation update for trace_policy_frequency
  2025-12-01 20:24 [PATCH v3 0/2] Replace trace_cpu_frequency with trace_policy_frequency Samuel Wu
  2025-12-01 20:24 ` [PATCH v3 1/2] cpufreq: " Samuel Wu
@ 2025-12-01 20:24 ` Samuel Wu
  2025-12-01 23:11   ` Tiffany Yang
  2025-12-02  6:43 ` [PATCH v3 0/2] Replace trace_cpu_frequency with trace_policy_frequency Viresh Kumar
  2 siblings, 1 reply; 14+ messages in thread
From: Samuel Wu @ 2025-12-01 20:24 UTC (permalink / raw)
  To: Huang Rui, Gautham R. Shenoy, Mario Limonciello, Perry Yuan,
	Jonathan Corbet, Rafael J. Wysocki, Viresh Kumar, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Srinivas Pandruvada,
	Len Brown, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, James Clark
  Cc: christian.loehle, Samuel Wu, kernel-team, linux-pm, linux-doc,
	linux-kernel, linux-trace-kernel, bpf, linux-perf-users

Documentation update corresponding to replace the cpu_frequency trace
event with the policy_frequency trace event.

Signed-off-by: Samuel Wu <wusamuel@google.com>
---
 Documentation/admin-guide/pm/amd-pstate.rst   | 10 +++++-----
 Documentation/admin-guide/pm/intel_pstate.rst | 14 +++++++-------
 Documentation/trace/events-power.rst          |  2 +-
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst
index e1771f2225d5..e110854ece88 100644
--- a/Documentation/admin-guide/pm/amd-pstate.rst
+++ b/Documentation/admin-guide/pm/amd-pstate.rst
@@ -503,8 +503,8 @@ Trace Events
 --------------
 
 There are two static trace events that can be used for ``amd-pstate``
-diagnostics. One of them is the ``cpu_frequency`` trace event generally used
-by ``CPUFreq``, and the other one is the ``amd_pstate_perf`` trace event
+diagnostics. One of them is the ``policy_frequency`` trace event generally
+used by ``CPUFreq``, and the other one is the ``amd_pstate_perf`` trace event
 specific to ``amd-pstate``.  The following sequence of shell commands can
 be used to enable them and see their output (if the kernel is
 configured to support event tracing). ::
@@ -531,9 +531,9 @@ configured to support event tracing). ::
           <idle>-0       [003] d.s..  4995.980971: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=3 changed=false fast_switch=true
           <idle>-0       [011] d.s..  4995.980996: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=11 changed=false fast_switch=true
 
-The ``cpu_frequency`` trace event will be triggered either by the ``schedutil`` scaling
-governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the
-policies with other scaling governors).
+The ``policy_frequency`` trace event will be triggered either by the
+``schedutil`` scaling governor (for the policies it is attached to), or by the
+``CPUFreq`` core (for the policies with other scaling governors).
 
 
 Tracer Tool
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
index fde967b0c2e0..274c9208f342 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -822,23 +822,23 @@ Trace Events
 ------------
 
 There are two static trace events that can be used for ``intel_pstate``
-diagnostics.  One of them is the ``cpu_frequency`` trace event generally used
-by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific
-to ``intel_pstate``.  Both of them are triggered by ``intel_pstate`` only if
-it works in the :ref:`active mode <active_mode>`.
+diagnostics.  One of them is the ``policy_frequency`` trace event generally
+used by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event
+specific to ``intel_pstate``.  Both of them are triggered by ``intel_pstate``
+only if it works in the :ref:`active mode <active_mode>`.
 
 The following sequence of shell commands can be used to enable them and see
 their output (if the kernel is generally configured to support event tracing)::
 
  # cd /sys/kernel/tracing/
  # echo 1 > events/power/pstate_sample/enable
- # echo 1 > events/power/cpu_frequency/enable
+ # echo 1 > events/power/policy_frequency/enable
  # cat trace
  gnome-terminal--4510  [001] ..s.  1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476
- cat-5235  [002] ..s.  1177.681723: cpu_frequency: state=2900000 cpu_id=2
+ cat-5235  [002] ..s.  1177.681723: policy_frequency: state=2900000 cpu_id=2 policy_cpus=04
 
 If ``intel_pstate`` works in the :ref:`passive mode <passive_mode>`, the
-``cpu_frequency`` trace event will be triggered either by the ``schedutil``
+``policy_frequency`` trace event will be triggered either by the ``schedutil``
 scaling governor (for the policies it is attached to), or by the ``CPUFreq``
 core (for the policies with other scaling governors).
 
diff --git a/Documentation/trace/events-power.rst b/Documentation/trace/events-power.rst
index f45bf11fa88d..f013c74b932f 100644
--- a/Documentation/trace/events-power.rst
+++ b/Documentation/trace/events-power.rst
@@ -26,8 +26,8 @@ cpufreq.
 ::
 
   cpu_idle		"state=%lu cpu_id=%lu"
-  cpu_frequency		"state=%lu cpu_id=%lu"
   cpu_frequency_limits	"min=%lu max=%lu cpu_id=%lu"
+  policy_frequency	"state=%lu cpu_id=%lu policy_cpus=%*pb"
 
 A suspend event is used to indicate the system going in and out of the
 suspend mode:
-- 
2.52.0.107.ga0afd4fd5b-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/2] cpufreq: Documentation update for trace_policy_frequency
  2025-12-01 20:24 ` [PATCH v3 2/2] cpufreq: Documentation update for trace_policy_frequency Samuel Wu
@ 2025-12-01 23:11   ` Tiffany Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Tiffany Yang @ 2025-12-01 23:11 UTC (permalink / raw)
  To: 'Samuel Wu' via kernel-team
  Cc: Huang Rui, Gautham R. Shenoy, Mario Limonciello, Perry Yuan,
	Jonathan Corbet, Rafael J. Wysocki, Viresh Kumar, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Srinivas Pandruvada,
	Len Brown, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, James Clark, Samuel Wu, christian.loehle, linux-pm,
	linux-doc, linux-kernel, linux-trace-kernel, bpf,
	linux-perf-users

Hi Sam,

IMO this type of documentation should be in the same patch as the
related change because having a single commit makes it easier to track
(especially if/when these changes are cherry-picked to other
trees). There may be reasons to keep them separate that I'm not thinking
of, so if others disagree, defer to them!


"'Samuel Wu' via kernel-team" <kernel-team@android.com> writes:

> Documentation update corresponding to replace the cpu_frequency trace
> event with the policy_frequency trace event.

> Signed-off-by: Samuel Wu <wusamuel@google.com>
> ---
>   Documentation/admin-guide/pm/amd-pstate.rst   | 10 +++++-----
>   Documentation/admin-guide/pm/intel_pstate.rst | 14 +++++++-------
>   Documentation/trace/events-power.rst          |  2 +-
>   3 files changed, 13 insertions(+), 13 deletions(-)
<snip>

>   A suspend event is used to indicate the system going in and out of the
>   suspend mode:

Otherwise, this change lgtm.

-- 
Tiffany Y. Yang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/2] Replace trace_cpu_frequency with trace_policy_frequency
  2025-12-01 20:24 [PATCH v3 0/2] Replace trace_cpu_frequency with trace_policy_frequency Samuel Wu
  2025-12-01 20:24 ` [PATCH v3 1/2] cpufreq: " Samuel Wu
  2025-12-01 20:24 ` [PATCH v3 2/2] cpufreq: Documentation update for trace_policy_frequency Samuel Wu
@ 2025-12-02  6:43 ` Viresh Kumar
  2 siblings, 0 replies; 14+ messages in thread
From: Viresh Kumar @ 2025-12-02  6:43 UTC (permalink / raw)
  To: Samuel Wu
  Cc: Huang Rui, Gautham R. Shenoy, Mario Limonciello, Perry Yuan,
	Jonathan Corbet, Rafael J. Wysocki, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Srinivas Pandruvada,
	Len Brown, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, James Clark, christian.loehle, kernel-team,
	linux-pm, linux-doc, linux-kernel, linux-trace-kernel, bpf,
	linux-perf-users

On 01-12-25, 12:24, Samuel Wu wrote:
> This series replaces the cpu_frequency trace event with a new trace event,
> policy_frequency. Since by definition all CPUs in a policy are of the same
> frequency, we can emit a frequency change per policy instead of per CPU.
> This saves some compute and memory from the kernel side, while simplifying
> analysis from the post-processing of the trace log side.
> 
> Any process that relied on cpu_frequency trace event needs to switch to the
> new policy_frequency trace event in order to maintain functionality. The
> decision of replacing instead of adding the trace event is intentional. Since
> emitting once per policy instead of once per CPU is anyways a semantics change
> that would require a tooling update, the trace event was also appropriately
> renamed. The presence of the policy_frequency event in a trace log is a clear
> and obvious signal for tooling to determine kernel version and which trace
> event to parse.
> 
> 1/2: Replaces trace_cpu_frequency with trace_policy_frequency
> 2/2: Corresponding documentation patch that updates references to
>      cpu_frequency with policy_frequency
> 
> Changes in v3:
> - Resending v2 properly (accidentally ommited cover letter in v2)
> 
> Changes in v2:
> - Replaced trace_cpu_frequency with trace_policy_frequency (per Christian
>   and Viresh)
> - Updated references to cpu_frequency in documentation with
>   policy_frequency
> - v1 link: https://lore.kernel.org/all/20251112235154.2974902-1-wusamuel@google.com
> 
> Samuel Wu (2):
>   cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
>   cpufreq: Documentation update for trace_policy_frequency
> 
>  Documentation/admin-guide/pm/amd-pstate.rst   | 10 ++++----
>  Documentation/admin-guide/pm/intel_pstate.rst | 14 +++++------
>  Documentation/trace/events-power.rst          |  2 +-
>  drivers/cpufreq/cpufreq.c                     | 14 ++---------
>  drivers/cpufreq/intel_pstate.c                |  6 +++--
>  include/trace/events/power.h                  | 24 ++++++++++++++++---
>  kernel/trace/power-traces.c                   |  2 +-
>  samples/bpf/cpustat_kern.c                    |  8 +++----
>  samples/bpf/cpustat_user.c                    |  6 ++---
>  tools/perf/builtin-timechart.c                | 12 +++++-----
>  10 files changed, 54 insertions(+), 44 deletions(-)

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
  2025-12-01 20:24 ` [PATCH v3 1/2] cpufreq: " Samuel Wu
@ 2025-12-02 12:03   ` Douglas Raillard
  2025-12-04  0:33     ` Samuel Wu
  2025-12-04 12:48   ` Christian Loehle
  1 sibling, 1 reply; 14+ messages in thread
From: Douglas Raillard @ 2025-12-02 12:03 UTC (permalink / raw)
  To: Samuel Wu, Huang Rui, Gautham R. Shenoy, Mario Limonciello,
	Perry Yuan, Jonathan Corbet, Rafael J. Wysocki, Viresh Kumar,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Srinivas Pandruvada, Len Brown, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Ian Rogers, Adrian Hunter, James Clark
  Cc: christian.loehle, kernel-team, linux-pm, linux-doc, linux-kernel,
	linux-trace-kernel, bpf, linux-perf-users

Hi Samuel,

On 01-12-2025 20:24, Samuel Wu wrote:
> The existing cpu_frequency trace_event can be verbose, emitting a nearly
> identical trace event for every CPU in the policy even when their
> frequencies are identical.
> 
> This patch replaces the cpu_frequency trace event with policy_frequency
> trace event, a more efficient alternative. From the kernel's
> perspective, emitting a trace event once per policy instead of once per
> cpu saves some memory and is less overhead.

I'd be fully behind that as a general guideline.

> From the post-processing
> perspective, analysis of the trace log is simplified without any loss of
> information.

Unfortunately I'm not so sure about the "simplified" part (as of today),
more on that below.

> 
> Signed-off-by: Samuel Wu <wusamuel@google.com>
> ---
>   drivers/cpufreq/cpufreq.c      | 14 ++------------
>   drivers/cpufreq/intel_pstate.c |  6 ++++--
>   include/trace/events/power.h   | 24 +++++++++++++++++++++---
>   kernel/trace/power-traces.c    |  2 +-
>   samples/bpf/cpustat_kern.c     |  8 ++++----
>   samples/bpf/cpustat_user.c     |  6 +++---
>   tools/perf/builtin-timechart.c | 12 ++++++------
>   7 files changed, 41 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 4472bb1ec83c..dd3f08f3b958 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -309,8 +309,6 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
>   				      struct cpufreq_freqs *freqs,
>   				      unsigned int state)
>   {
> -	int cpu;
> -
>   	BUG_ON(irqs_disabled());
>   
>   	if (cpufreq_disabled())
> @@ -344,10 +342,7 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
>   		adjust_jiffies(CPUFREQ_POSTCHANGE, freqs);
>   		pr_debug("FREQ: %u - CPUs: %*pbl\n", freqs->new,
>   			 cpumask_pr_args(policy->cpus));
> -
> -		for_each_cpu(cpu, policy->cpus)
> -			trace_cpu_frequency(freqs->new, cpu);
> -
> +		trace_policy_frequency(freqs->new, policy->cpu, policy->cpus);
>   		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
>   					 CPUFREQ_POSTCHANGE, freqs);
>   
> @@ -2201,7 +2196,6 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>   					unsigned int target_freq)
>   {
>   	unsigned int freq;
> -	int cpu;
>   
>   	target_freq = clamp_val(target_freq, policy->min, policy->max);
>   	freq = cpufreq_driver->fast_switch(policy, target_freq);
> @@ -2213,11 +2207,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>   	arch_set_freq_scale(policy->related_cpus, freq,
>   			    arch_scale_freq_ref(policy->cpu));
>   	cpufreq_stats_record_transition(policy, freq);
> -
> -	if (trace_cpu_frequency_enabled()) {
> -		for_each_cpu(cpu, policy->cpus)
> -			trace_cpu_frequency(freq, cpu);
> -	}
> +	trace_policy_frequency(freq, policy->cpu, policy->cpus);
>   
>   	return freq;
>   }
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index ec4abe374573..9724b5d19d83 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2297,7 +2297,8 @@ static int hwp_get_cpu_scaling(int cpu)
>   
>   static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate)
>   {
> -	trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu);
> +	trace_policy_frequency(pstate * cpu->pstate.scaling, cpu->cpu,
> +			       cpumask_of(cpu->cpu));
>   	cpu->pstate.current_pstate = pstate;
>   	/*
>   	 * Generally, there is no guarantee that this code will always run on
> @@ -2587,7 +2588,8 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
>   
>   	target_pstate = get_target_pstate(cpu);
>   	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> -	trace_cpu_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu);
> +	trace_policy_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu,
> +			       cpumask_of(cpu->cpu));
>   	intel_pstate_update_pstate(cpu, target_pstate);
>   
>   	sample = &cpu->sample;
> diff --git a/include/trace/events/power.h b/include/trace/events/power.h
> index 370f8df2fdb4..317098ffdd5f 100644
> --- a/include/trace/events/power.h
> +++ b/include/trace/events/power.h
> @@ -182,11 +182,29 @@ TRACE_EVENT(pstate_sample,
>   		{ PM_EVENT_RECOVER, "recover" }, \
>   		{ PM_EVENT_POWEROFF, "poweroff" })
>   
> -DEFINE_EVENT(cpu, cpu_frequency,
> +TRACE_EVENT(policy_frequency,
>   
> -	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> +	TP_PROTO(unsigned int frequency, unsigned int cpu_id,
> +		 const struct cpumask *policy_cpus),
>   
> -	TP_ARGS(frequency, cpu_id)
> +	TP_ARGS(frequency, cpu_id, policy_cpus),
> +
> +	TP_STRUCT__entry(
> +		__field(u32, state)
> +		__field(u32, cpu_id)
> +		__cpumask(cpumask)

Using a cpumask is the most technically correct option here, but it also carries a big issue.
Userspace tooling will have a very hard time doing anything with it. A lot of that is down
to having no appropriate counterpart in "table libraries" in general (e.g. what would you
map that to in pandas, polars or SQL ?). Some of the lack of support is probably also down to how
infrequently used it is. For example I don't think Perfetto would be able to handle that
in the ftrace_event and args table, as the documented supported value types are:

args table:
value_type 	STRING 	The type of the value of the arg. Will be one of 'int', 'uint', 'string', 'real', 'pointer', 'bool' or 'json'.
https://perfetto.dev/docs/analysis/stdlib-docs

So while I definitely support improving the situation around cpumasks (I lobbied a bit for that),
I don't think the ecosystem is ready for it yet and having such a core event switched to using it
is going to cause a lot of pain.

Some alternatives for tooling could be:
1. Record the policy cpumasks in a tool-friendly format in the trace header, but no current format I know
    of provides that, and ftrace does not provide a "JSON blob to be passed through" we could easily append to.
    Any such addition will therefore require libraries update which will take time.

3. Doing without the data in the trace. That means collecting and bundling another sidecar file, which
    is really not convenient and still requires 3rd party tool modifications for end users.

2. Add policy_frequency event, but not remove cpu_frequency yet. Possibly with a deprecation warning
    when enabling the event.

> +	),
> +
> +	TP_fast_assign(
> +		__entry->state = frequency;
> +		__entry->cpu_id = cpu_id;
> +		__assign_cpumask(cpumask, policy_cpus);

ipi_send_cpumask uses cpumask_bits():

		__assign_cpumask(cpumask, cpumask_bits(cpumask));

It's not clear what is best practice, as struct cpumask contains a single member anyway and
__assign_cpumask() expands to a memcpy() so they are functionally identical.

> +	),
> +
> +	TP_printk("state=%lu cpu_id=%lu policy_cpus=%*pb",
> +		  (unsigned long)__entry->state,
> +		  (unsigned long)__entry->cpu_id,
> +		  cpumask_pr_args((struct cpumask *)__get_dynamic_array(cpumask)))

Looking at ipi_send_cpumask, this should be:

   __get_cpumask(cpumask)

The cast and cpumask_pr_args() may look like it's working, but there is only a very slim
chance any downstream tool will know what to do with this. Looking at libtraceevent
(which trace-cmd is based on):

./utest/traceevent-utest.c:116: "print fmt: \"cpumask=%s\", __get_cpumask(cpumask)\n";
./src/event-parse.c:3674:       if (strcmp(token, "__get_cpumask") == 0 ||
./src/event-parse.c:7568:               printf("__get_cpumask(%s)", args->bitmask.bitmask);

But there is no match for "cpumask_pr_args".

Considering the gap between what works when using the in-kernel text rendering and what can be
reasonably expected to work in any other userspace tool, it's a good idea to try
as many as possible unfortunately.

>   );
>   
>   TRACE_EVENT(cpu_frequency_limits,
> diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
> index f2fe33573e54..a537e68a6878 100644
> --- a/kernel/trace/power-traces.c
> +++ b/kernel/trace/power-traces.c
> @@ -16,5 +16,5 @@
>   
>   EXPORT_TRACEPOINT_SYMBOL_GPL(suspend_resume);
>   EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
> -EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_frequency);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(policy_frequency);
>   
> diff --git a/samples/bpf/cpustat_kern.c b/samples/bpf/cpustat_kern.c
> index 7ec7143e2757..f485de0f89b2 100644
> --- a/samples/bpf/cpustat_kern.c
> +++ b/samples/bpf/cpustat_kern.c
> @@ -75,9 +75,9 @@ struct {
>   } pstate_duration SEC(".maps");
>   
>   /*
> - * The trace events for cpu_idle and cpu_frequency are taken from:
> + * The trace events for cpu_idle and policy_frequency are taken from:
>    * /sys/kernel/tracing/events/power/cpu_idle/format
> - * /sys/kernel/tracing/events/power/cpu_frequency/format
> + * /sys/kernel/tracing/events/power/policy_frequency/format
>    *
>    * These two events have same format, so define one common structure.
>    */
> @@ -162,7 +162,7 @@ int bpf_prog1(struct cpu_args *ctx)
>   	 */
>   	if (ctx->state != (u32)-1) {
>   
> -		/* record pstate after have first cpu_frequency event */
> +		/* record pstate after have first policy_frequency event */
>   		if (!*pts)
>   			return 0;
>   
> @@ -208,7 +208,7 @@ int bpf_prog1(struct cpu_args *ctx)
>   	return 0;
>   }
>   
> -SEC("tracepoint/power/cpu_frequency")
> +SEC("tracepoint/power/policy_frequency")
>   int bpf_prog2(struct cpu_args *ctx)
>   {
>   	u64 *pts, *cstate, *pstate, cur_ts, delta;
> diff --git a/samples/bpf/cpustat_user.c b/samples/bpf/cpustat_user.c
> index 356f756cba0d..f7e81f702358 100644
> --- a/samples/bpf/cpustat_user.c
> +++ b/samples/bpf/cpustat_user.c
> @@ -143,12 +143,12 @@ static int cpu_stat_inject_cpu_idle_event(void)
>   
>   /*
>    * It's possible to have no any frequency change for long time and cannot
> - * get ftrace event 'trace_cpu_frequency' for long period, this introduces
> + * get ftrace event 'trace_policy_frequency' for long period, this introduces
>    * big deviation for pstate statistics.
>    *
>    * To solve this issue, below code forces to set 'scaling_max_freq' to 208MHz
> - * for triggering ftrace event 'trace_cpu_frequency' and then recovery back to
> - * the maximum frequency value 1.2GHz.
> + * for triggering ftrace event 'trace_policy_frequency' and then recovery back
> + * to the maximum frequency value 1.2GHz.
>    */
>   static int cpu_stat_inject_cpu_frequency_event(void)
>   {
> diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
> index 22050c640dfa..3ef1a2fd0493 100644
> --- a/tools/perf/builtin-timechart.c
> +++ b/tools/perf/builtin-timechart.c
> @@ -612,10 +612,10 @@ process_sample_cpu_idle(struct timechart *tchart __maybe_unused,
>   }
>   
>   static int
> -process_sample_cpu_frequency(struct timechart *tchart,
> -			     struct evsel *evsel,
> -			     struct perf_sample *sample,
> -			     const char *backtrace __maybe_unused)
> +process_sample_policy_frequency(struct timechart *tchart,
> +				struct evsel *evsel,
> +				struct perf_sample *sample,
> +				const char *backtrace __maybe_unused)
>   {
>   	u32 state  = evsel__intval(evsel, sample, "state");
>   	u32 cpu_id = evsel__intval(evsel, sample, "cpu_id");
> @@ -1541,7 +1541,7 @@ static int __cmd_timechart(struct timechart *tchart, const char *output_name)
>   {
>   	const struct evsel_str_handler power_tracepoints[] = {
>   		{ "power:cpu_idle",		process_sample_cpu_idle },
> -		{ "power:cpu_frequency",	process_sample_cpu_frequency },
> +		{ "power:policy_frequency",	process_sample_policy_frequency },
>   		{ "sched:sched_wakeup",		process_sample_sched_wakeup },
>   		{ "sched:sched_switch",		process_sample_sched_switch },
>   #ifdef SUPPORT_OLD_POWER_EVENTS
> @@ -1804,7 +1804,7 @@ static int timechart__record(struct timechart *tchart, int argc, const char **ar
>   	unsigned int backtrace_args_no = ARRAY_SIZE(backtrace_args);
>   
>   	const char * const power_args[] = {
> -		"-e", "power:cpu_frequency",
> +		"-e", "power:policy_frequency",
>   		"-e", "power:cpu_idle",
>   	};
>   	unsigned int power_args_nr = ARRAY_SIZE(power_args);

--

Douglas


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
  2025-12-02 12:03   ` Douglas Raillard
@ 2025-12-04  0:33     ` Samuel Wu
  0 siblings, 0 replies; 14+ messages in thread
From: Samuel Wu @ 2025-12-04  0:33 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: Huang Rui, Gautham R. Shenoy, Mario Limonciello, Perry Yuan,
	Jonathan Corbet, Rafael J. Wysocki, Viresh Kumar, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Srinivas Pandruvada,
	Len Brown, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, James Clark, christian.loehle, kernel-team,
	linux-pm, linux-doc, linux-kernel, linux-trace-kernel, bpf,
	linux-perf-users, Lalit Maganti

On Tue, Dec 2, 2025 at 4:03 AM Douglas Raillard
<douglas.raillard@arm.com> wrote:
>
> Hi Samuel,
>
> On 01-12-2025 20:24, Samuel Wu wrote:
> > The existing cpu_frequency trace_event can be verbose, emitting a nearly
> > identical trace event for every CPU in the policy even when their
> > frequencies are identical.
> >
> > This patch replaces the cpu_frequency trace event with policy_frequency
> > trace event, a more efficient alternative. From the kernel's
> > perspective, emitting a trace event once per policy instead of once per
> > cpu saves some memory and is less overhead.
>
> I'd be fully behind that as a general guideline.
>
> > From the post-processing
> > perspective, analysis of the trace log is simplified without any loss of
> > information.
>
> Unfortunately I'm not so sure about the "simplified" part (as of today),
> more on that below.
>
> >
> > Signed-off-by: Samuel Wu <wusamuel@google.com>
> > ---
> >   drivers/cpufreq/cpufreq.c      | 14 ++------------
> >   drivers/cpufreq/intel_pstate.c |  6 ++++--
> >   include/trace/events/power.h   | 24 +++++++++++++++++++++---
> >   kernel/trace/power-traces.c    |  2 +-
> >   samples/bpf/cpustat_kern.c     |  8 ++++----
> >   samples/bpf/cpustat_user.c     |  6 +++---
> >   tools/perf/builtin-timechart.c | 12 ++++++------
> >   7 files changed, 41 insertions(+), 31 deletions(-)
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 4472bb1ec83c..dd3f08f3b958 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -309,8 +309,6 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
> >                                     struct cpufreq_freqs *freqs,
> >                                     unsigned int state)
> >   {
> > -     int cpu;
> > -
> >       BUG_ON(irqs_disabled());
> >
> >       if (cpufreq_disabled())
> > @@ -344,10 +342,7 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
> >               adjust_jiffies(CPUFREQ_POSTCHANGE, freqs);
> >               pr_debug("FREQ: %u - CPUs: %*pbl\n", freqs->new,
> >                        cpumask_pr_args(policy->cpus));
> > -
> > -             for_each_cpu(cpu, policy->cpus)
> > -                     trace_cpu_frequency(freqs->new, cpu);
> > -
> > +             trace_policy_frequency(freqs->new, policy->cpu, policy->cpus);
> >               srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
> >                                        CPUFREQ_POSTCHANGE, freqs);
> >
> > @@ -2201,7 +2196,6 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> >                                       unsigned int target_freq)
> >   {
> >       unsigned int freq;
> > -     int cpu;
> >
> >       target_freq = clamp_val(target_freq, policy->min, policy->max);
> >       freq = cpufreq_driver->fast_switch(policy, target_freq);
> > @@ -2213,11 +2207,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> >       arch_set_freq_scale(policy->related_cpus, freq,
> >                           arch_scale_freq_ref(policy->cpu));
> >       cpufreq_stats_record_transition(policy, freq);
> > -
> > -     if (trace_cpu_frequency_enabled()) {
> > -             for_each_cpu(cpu, policy->cpus)
> > -                     trace_cpu_frequency(freq, cpu);
> > -     }
> > +     trace_policy_frequency(freq, policy->cpu, policy->cpus);
> >
> >       return freq;
> >   }
> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > index ec4abe374573..9724b5d19d83 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -2297,7 +2297,8 @@ static int hwp_get_cpu_scaling(int cpu)
> >
> >   static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate)
> >   {
> > -     trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu);
> > +     trace_policy_frequency(pstate * cpu->pstate.scaling, cpu->cpu,
> > +                            cpumask_of(cpu->cpu));
> >       cpu->pstate.current_pstate = pstate;
> >       /*
> >        * Generally, there is no guarantee that this code will always run on
> > @@ -2587,7 +2588,8 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
> >
> >       target_pstate = get_target_pstate(cpu);
> >       target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> > -     trace_cpu_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu);
> > +     trace_policy_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu,
> > +                            cpumask_of(cpu->cpu));
> >       intel_pstate_update_pstate(cpu, target_pstate);
> >
> >       sample = &cpu->sample;
> > diff --git a/include/trace/events/power.h b/include/trace/events/power.h
> > index 370f8df2fdb4..317098ffdd5f 100644
> > --- a/include/trace/events/power.h
> > +++ b/include/trace/events/power.h
> > @@ -182,11 +182,29 @@ TRACE_EVENT(pstate_sample,
> >               { PM_EVENT_RECOVER, "recover" }, \
> >               { PM_EVENT_POWEROFF, "poweroff" })
> >
> > -DEFINE_EVENT(cpu, cpu_frequency,
> > +TRACE_EVENT(policy_frequency,
> >
> > -     TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > +     TP_PROTO(unsigned int frequency, unsigned int cpu_id,
> > +              const struct cpumask *policy_cpus),
> >
> > -     TP_ARGS(frequency, cpu_id)
> > +     TP_ARGS(frequency, cpu_id, policy_cpus),
> > +
> > +     TP_STRUCT__entry(
> > +             __field(u32, state)
> > +             __field(u32, cpu_id)
> > +             __cpumask(cpumask)
>
> Using a cpumask is the most technically correct option here, but it also carries a big issue.
> Userspace tooling will have a very hard time doing anything with it. A lot of that is down
> to having no appropriate counterpart in "table libraries" in general (e.g. what would you
> map that to in pandas, polars or SQL ?). Some of the lack of support is probably also down to how
> infrequently used it is. For example I don't think Perfetto would be able to handle that
> in the ftrace_event and args table, as the documented supported value types are:

I've been in touch with the Perfetto team through this process, and it
is an easy update for them to handle. If there are other libraries
using this event, I think their approach would be similar to how
Perfetto's SQL tables handle this new trace event.

> args table:
> value_type      STRING  The type of the value of the arg. Will be one of 'int', 'uint', 'string', 'real', 'pointer', 'bool' or 'json'.
> https://perfetto.dev/docs/analysis/stdlib-docs

In the same URL as above, there is a 'cpu_frequency_counters' SQL
table documentation. The new tracepoint would be parsed by Perfetto's
C++ parser before being inserted into the aforementioned SQL table as
four separate rows, given a policy with 4 CPUs. This effectively
creates the same table with the same data (just slightly different
timestamps) as prior to this patch.

>
> So while I definitely support improving the situation around cpumasks (I lobbied a bit for that),
> I don't think the ecosystem is ready for it yet and having such a core event switched to using it
> is going to cause a lot of pain.
>
> Some alternatives for tooling could be:
> 1. Record the policy cpumasks in a tool-friendly format in the trace header, but no current format I know
>     of provides that, and ftrace does not provide a "JSON blob to be passed through" we could easily append to.
>     Any such addition will therefore require libraries update which will take time.
>
> 3. Doing without the data in the trace. That means collecting and bundling another sidecar file, which
>     is really not convenient and still requires 3rd party tool modifications for end users.
>
> 2. Add policy_frequency event, but not remove cpu_frequency yet. Possibly with a deprecation warning
>     when enabling the event.
>

These alternatives should work, but I feel are probably too complex to
be worth the effort.

> > +     ),
> > +
> > +     TP_fast_assign(
> > +             __entry->state = frequency;
> > +             __entry->cpu_id = cpu_id;
> > +             __assign_cpumask(cpumask, policy_cpus);
>
> ipi_send_cpumask uses cpumask_bits():
>
>                 __assign_cpumask(cpumask, cpumask_bits(cpumask));
>
> It's not clear what is best practice, as struct cpumask contains a single member anyway and
> __assign_cpumask() expands to a memcpy() so they are functionally identical.
>
> > +     ),
> > +
> > +     TP_printk("state=%lu cpu_id=%lu policy_cpus=%*pb",
> > +               (unsigned long)__entry->state,
> > +               (unsigned long)__entry->cpu_id,
> > +               cpumask_pr_args((struct cpumask *)__get_dynamic_array(cpumask)))
>
> Looking at ipi_send_cpumask, this should be:
>
>    __get_cpumask(cpumask)
>
> The cast and cpumask_pr_args() may look like it's working, but there is only a very slim
> chance any downstream tool will know what to do with this. Looking at libtraceevent
> (which trace-cmd is based on):
>
> ./utest/traceevent-utest.c:116: "print fmt: \"cpumask=%s\", __get_cpumask(cpumask)\n";
> ./src/event-parse.c:3674:       if (strcmp(token, "__get_cpumask") == 0 ||
> ./src/event-parse.c:7568:               printf("__get_cpumask(%s)", args->bitmask.bitmask);
>
> But there is no match for "cpumask_pr_args".

Thanks for pointing out libtraceevent- I didn't know about this tool,
but I'll take a look at compatibility and adjust appropriately. The
macros are a little tricky, but I agree that it seems best to follow
the template set out by the pre-existing ipi_send_cpumask.

>
> Considering the gap between what works when using the in-kernel text rendering and what can be
> reasonably expected to work in any other userspace tool, it's a good idea to try
> as many as possible unfortunately.
>

Overall, I appreciate the thorough and insightful feedback Douglas! If
there are any other tools consuming cpu_frequency, I can help update
them appropriately. AFAICT Perfetto is by far, the most widespread
tool consuming cpu_frequency.

> >   );
> >
> >   TRACE_EVENT(cpu_frequency_limits,
> > diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
> > index f2fe33573e54..a537e68a6878 100644
> > --- a/kernel/trace/power-traces.c
> > +++ b/kernel/trace/power-traces.c
> > @@ -16,5 +16,5 @@
> >
> >   EXPORT_TRACEPOINT_SYMBOL_GPL(suspend_resume);
> >   EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
> > -EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_frequency);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(policy_frequency);
> >
> > diff --git a/samples/bpf/cpustat_kern.c b/samples/bpf/cpustat_kern.c
> > index 7ec7143e2757..f485de0f89b2 100644
> > --- a/samples/bpf/cpustat_kern.c
> > +++ b/samples/bpf/cpustat_kern.c
> > @@ -75,9 +75,9 @@ struct {
> >   } pstate_duration SEC(".maps");
> >
> >   /*
> > - * The trace events for cpu_idle and cpu_frequency are taken from:
> > + * The trace events for cpu_idle and policy_frequency are taken from:
> >    * /sys/kernel/tracing/events/power/cpu_idle/format
> > - * /sys/kernel/tracing/events/power/cpu_frequency/format
> > + * /sys/kernel/tracing/events/power/policy_frequency/format
> >    *
> >    * These two events have same format, so define one common structure.
> >    */
> > @@ -162,7 +162,7 @@ int bpf_prog1(struct cpu_args *ctx)
> >        */
> >       if (ctx->state != (u32)-1) {
> >
> > -             /* record pstate after have first cpu_frequency event */
> > +             /* record pstate after have first policy_frequency event */
> >               if (!*pts)
> >                       return 0;
> >
> > @@ -208,7 +208,7 @@ int bpf_prog1(struct cpu_args *ctx)
> >       return 0;
> >   }
> >
> > -SEC("tracepoint/power/cpu_frequency")
> > +SEC("tracepoint/power/policy_frequency")
> >   int bpf_prog2(struct cpu_args *ctx)
> >   {
> >       u64 *pts, *cstate, *pstate, cur_ts, delta;
> > diff --git a/samples/bpf/cpustat_user.c b/samples/bpf/cpustat_user.c
> > index 356f756cba0d..f7e81f702358 100644
> > --- a/samples/bpf/cpustat_user.c
> > +++ b/samples/bpf/cpustat_user.c
> > @@ -143,12 +143,12 @@ static int cpu_stat_inject_cpu_idle_event(void)
> >
> >   /*
> >    * It's possible to have no any frequency change for long time and cannot
> > - * get ftrace event 'trace_cpu_frequency' for long period, this introduces
> > + * get ftrace event 'trace_policy_frequency' for long period, this introduces
> >    * big deviation for pstate statistics.
> >    *
> >    * To solve this issue, below code forces to set 'scaling_max_freq' to 208MHz
> > - * for triggering ftrace event 'trace_cpu_frequency' and then recovery back to
> > - * the maximum frequency value 1.2GHz.
> > + * for triggering ftrace event 'trace_policy_frequency' and then recovery back
> > + * to the maximum frequency value 1.2GHz.
> >    */
> >   static int cpu_stat_inject_cpu_frequency_event(void)
> >   {
> > diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
> > index 22050c640dfa..3ef1a2fd0493 100644
> > --- a/tools/perf/builtin-timechart.c
> > +++ b/tools/perf/builtin-timechart.c
> > @@ -612,10 +612,10 @@ process_sample_cpu_idle(struct timechart *tchart __maybe_unused,
> >   }
> >
> >   static int
> > -process_sample_cpu_frequency(struct timechart *tchart,
> > -                          struct evsel *evsel,
> > -                          struct perf_sample *sample,
> > -                          const char *backtrace __maybe_unused)
> > +process_sample_policy_frequency(struct timechart *tchart,
> > +                             struct evsel *evsel,
> > +                             struct perf_sample *sample,
> > +                             const char *backtrace __maybe_unused)
> >   {
> >       u32 state  = evsel__intval(evsel, sample, "state");
> >       u32 cpu_id = evsel__intval(evsel, sample, "cpu_id");
> > @@ -1541,7 +1541,7 @@ static int __cmd_timechart(struct timechart *tchart, const char *output_name)
> >   {
> >       const struct evsel_str_handler power_tracepoints[] = {
> >               { "power:cpu_idle",             process_sample_cpu_idle },
> > -             { "power:cpu_frequency",        process_sample_cpu_frequency },
> > +             { "power:policy_frequency",     process_sample_policy_frequency },
> >               { "sched:sched_wakeup",         process_sample_sched_wakeup },
> >               { "sched:sched_switch",         process_sample_sched_switch },
> >   #ifdef SUPPORT_OLD_POWER_EVENTS
> > @@ -1804,7 +1804,7 @@ static int timechart__record(struct timechart *tchart, int argc, const char **ar
> >       unsigned int backtrace_args_no = ARRAY_SIZE(backtrace_args);
> >
> >       const char * const power_args[] = {
> > -             "-e", "power:cpu_frequency",
> > +             "-e", "power:policy_frequency",
> >               "-e", "power:cpu_idle",
> >       };
> >       unsigned int power_args_nr = ARRAY_SIZE(power_args);
>
> --
>
> Douglas
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
  2025-12-01 20:24 ` [PATCH v3 1/2] cpufreq: " Samuel Wu
  2025-12-02 12:03   ` Douglas Raillard
@ 2025-12-04 12:48   ` Christian Loehle
  2025-12-04 14:57     ` Rafael J. Wysocki
  1 sibling, 1 reply; 14+ messages in thread
From: Christian Loehle @ 2025-12-04 12:48 UTC (permalink / raw)
  To: Samuel Wu, Huang Rui, Gautham R. Shenoy, Mario Limonciello,
	Perry Yuan, Jonathan Corbet, Rafael J. Wysocki, Viresh Kumar,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Srinivas Pandruvada, Len Brown, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Ian Rogers, Adrian Hunter, James Clark
  Cc: kernel-team, linux-pm, linux-doc, linux-kernel,
	linux-trace-kernel, bpf, linux-perf-users

On 12/1/25 20:24, Samuel Wu wrote:
> The existing cpu_frequency trace_event can be verbose, emitting a nearly
> identical trace event for every CPU in the policy even when their
> frequencies are identical.
> 
> This patch replaces the cpu_frequency trace event with policy_frequency
> trace event, a more efficient alternative. From the kernel's
> perspective, emitting a trace event once per policy instead of once per
> cpu saves some memory and is less overhead. From the post-processing
> perspective, analysis of the trace log is simplified without any loss of
> information.
> 
> Signed-off-by: Samuel Wu <wusamuel@google.com>
> ---
>  drivers/cpufreq/cpufreq.c      | 14 ++------------
>  drivers/cpufreq/intel_pstate.c |  6 ++++--
>  include/trace/events/power.h   | 24 +++++++++++++++++++++---
>  kernel/trace/power-traces.c    |  2 +-
>  samples/bpf/cpustat_kern.c     |  8 ++++----
>  samples/bpf/cpustat_user.c     |  6 +++---
>  tools/perf/builtin-timechart.c | 12 ++++++------
>  7 files changed, 41 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 4472bb1ec83c..dd3f08f3b958 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -309,8 +309,6 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
>  				      struct cpufreq_freqs *freqs,
>  				      unsigned int state)
>  {
> -	int cpu;
> -
>  	BUG_ON(irqs_disabled());
>  
>  	if (cpufreq_disabled())
> @@ -344,10 +342,7 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
>  		adjust_jiffies(CPUFREQ_POSTCHANGE, freqs);
>  		pr_debug("FREQ: %u - CPUs: %*pbl\n", freqs->new,
>  			 cpumask_pr_args(policy->cpus));
> -
> -		for_each_cpu(cpu, policy->cpus)
> -			trace_cpu_frequency(freqs->new, cpu);
> -
> +		trace_policy_frequency(freqs->new, policy->cpu, policy->cpus);
>  		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
>  					 CPUFREQ_POSTCHANGE, freqs);
>  
> @@ -2201,7 +2196,6 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>  					unsigned int target_freq)
>  {
>  	unsigned int freq;
> -	int cpu;
>  
>  	target_freq = clamp_val(target_freq, policy->min, policy->max);
>  	freq = cpufreq_driver->fast_switch(policy, target_freq);
> @@ -2213,11 +2207,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>  	arch_set_freq_scale(policy->related_cpus, freq,
>  			    arch_scale_freq_ref(policy->cpu));
>  	cpufreq_stats_record_transition(policy, freq);
> -
> -	if (trace_cpu_frequency_enabled()) {
> -		for_each_cpu(cpu, policy->cpus)
> -			trace_cpu_frequency(freq, cpu);
> -	}
> +	trace_policy_frequency(freq, policy->cpu, policy->cpus);
>  
>  	return freq;
>  }
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index ec4abe374573..9724b5d19d83 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2297,7 +2297,8 @@ static int hwp_get_cpu_scaling(int cpu)
>  
>  static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate)
>  {
> -	trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu);
> +	trace_policy_frequency(pstate * cpu->pstate.scaling, cpu->cpu,
> +			       cpumask_of(cpu->cpu));
>  	cpu->pstate.current_pstate = pstate;
>  	/*
>  	 * Generally, there is no guarantee that this code will always run on
> @@ -2587,7 +2588,8 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
>  
>  	target_pstate = get_target_pstate(cpu);
>  	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> -	trace_cpu_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu);
> +	trace_policy_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu,
> +			       cpumask_of(cpu->cpu));
>  	intel_pstate_update_pstate(cpu, target_pstate);
>  
>  	sample = &cpu->sample;
> diff --git a/include/trace/events/power.h b/include/trace/events/power.h
> index 370f8df2fdb4..317098ffdd5f 100644
> --- a/include/trace/events/power.h
> +++ b/include/trace/events/power.h
> @@ -182,11 +182,29 @@ TRACE_EVENT(pstate_sample,
>  		{ PM_EVENT_RECOVER, "recover" }, \
>  		{ PM_EVENT_POWEROFF, "poweroff" })
>  
> -DEFINE_EVENT(cpu, cpu_frequency,
> +TRACE_EVENT(policy_frequency,
>  
> -	TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> +	TP_PROTO(unsigned int frequency, unsigned int cpu_id,
> +		 const struct cpumask *policy_cpus),
>  
> -	TP_ARGS(frequency, cpu_id)
> +	TP_ARGS(frequency, cpu_id, policy_cpus),
> +
> +	TP_STRUCT__entry(
> +		__field(u32, state)
> +		__field(u32, cpu_id)
> +		__cpumask(cpumask)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->state = frequency;
> +		__entry->cpu_id = cpu_id;
> +		__assign_cpumask(cpumask, policy_cpus);
> +	),
> +
> +	TP_printk("state=%lu cpu_id=%lu policy_cpus=%*pb",
> +		  (unsigned long)__entry->state,
> +		  (unsigned long)__entry->cpu_id,
> +		  cpumask_pr_args((struct cpumask *)__get_dynamic_array(cpumask)))
>  );
>  
>  TRACE_EVENT(cpu_frequency_limits,
> diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
> index f2fe33573e54..a537e68a6878 100644
> --- a/kernel/trace/power-traces.c
> +++ b/kernel/trace/power-traces.c
> @@ -16,5 +16,5 @@
>  
>  EXPORT_TRACEPOINT_SYMBOL_GPL(suspend_resume);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
> -EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_frequency);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(policy_frequency);
>  
> diff --git a/samples/bpf/cpustat_kern.c b/samples/bpf/cpustat_kern.c
> index 7ec7143e2757..f485de0f89b2 100644
> --- a/samples/bpf/cpustat_kern.c
> +++ b/samples/bpf/cpustat_kern.c
> @@ -75,9 +75,9 @@ struct {
>  } pstate_duration SEC(".maps");
>  
>  /*
> - * The trace events for cpu_idle and cpu_frequency are taken from:
> + * The trace events for cpu_idle and policy_frequency are taken from:
>   * /sys/kernel/tracing/events/power/cpu_idle/format
> - * /sys/kernel/tracing/events/power/cpu_frequency/format
> + * /sys/kernel/tracing/events/power/policy_frequency/format
>   *
>   * These two events have same format, so define one common structure.
>   */
> @@ -162,7 +162,7 @@ int bpf_prog1(struct cpu_args *ctx)
>  	 */
>  	if (ctx->state != (u32)-1) {
>  
> -		/* record pstate after have first cpu_frequency event */
> +		/* record pstate after have first policy_frequency event */
>  		if (!*pts)
>  			return 0;
>  
> @@ -208,7 +208,7 @@ int bpf_prog1(struct cpu_args *ctx)
>  	return 0;
>  }
>  
> -SEC("tracepoint/power/cpu_frequency")
> +SEC("tracepoint/power/policy_frequency")
>  int bpf_prog2(struct cpu_args *ctx)
>  {
>  	u64 *pts, *cstate, *pstate, cur_ts, delta;
> diff --git a/samples/bpf/cpustat_user.c b/samples/bpf/cpustat_user.c
> index 356f756cba0d..f7e81f702358 100644
> --- a/samples/bpf/cpustat_user.c
> +++ b/samples/bpf/cpustat_user.c
> @@ -143,12 +143,12 @@ static int cpu_stat_inject_cpu_idle_event(void)
>  
>  /*
>   * It's possible to have no any frequency change for long time and cannot
> - * get ftrace event 'trace_cpu_frequency' for long period, this introduces
> + * get ftrace event 'trace_policy_frequency' for long period, this introduces
>   * big deviation for pstate statistics.
>   *
>   * To solve this issue, below code forces to set 'scaling_max_freq' to 208MHz
> - * for triggering ftrace event 'trace_cpu_frequency' and then recovery back to
> - * the maximum frequency value 1.2GHz.
> + * for triggering ftrace event 'trace_policy_frequency' and then recovery back
> + * to the maximum frequency value 1.2GHz.
>   */
>  static int cpu_stat_inject_cpu_frequency_event(void)
>  {
> diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
> index 22050c640dfa..3ef1a2fd0493 100644
> --- a/tools/perf/builtin-timechart.c
> +++ b/tools/perf/builtin-timechart.c
> @@ -612,10 +612,10 @@ process_sample_cpu_idle(struct timechart *tchart __maybe_unused,
>  }
>  
>  static int
> -process_sample_cpu_frequency(struct timechart *tchart,
> -			     struct evsel *evsel,
> -			     struct perf_sample *sample,
> -			     const char *backtrace __maybe_unused)
> +process_sample_policy_frequency(struct timechart *tchart,
> +				struct evsel *evsel,
> +				struct perf_sample *sample,
> +				const char *backtrace __maybe_unused)
>  {
>  	u32 state  = evsel__intval(evsel, sample, "state");
>  	u32 cpu_id = evsel__intval(evsel, sample, "cpu_id");
> @@ -1541,7 +1541,7 @@ static int __cmd_timechart(struct timechart *tchart, const char *output_name)
>  {
>  	const struct evsel_str_handler power_tracepoints[] = {
>  		{ "power:cpu_idle",		process_sample_cpu_idle },
> -		{ "power:cpu_frequency",	process_sample_cpu_frequency },
> +		{ "power:policy_frequency",	process_sample_policy_frequency },
>  		{ "sched:sched_wakeup",		process_sample_sched_wakeup },
>  		{ "sched:sched_switch",		process_sample_sched_switch },
>  #ifdef SUPPORT_OLD_POWER_EVENTS
> @@ -1804,7 +1804,7 @@ static int timechart__record(struct timechart *tchart, int argc, const char **ar
>  	unsigned int backtrace_args_no = ARRAY_SIZE(backtrace_args);
>  
>  	const char * const power_args[] = {
> -		"-e", "power:cpu_frequency",
> +		"-e", "power:policy_frequency",
>  		"-e", "power:cpu_idle",
>  	};
>  	unsigned int power_args_nr = ARRAY_SIZE(power_args);

perf timechart seem to do per-CPU reporting though?
So this is broken by not emitting an event per-CPU? At least with a simple s/cpu_frequency/policy_frequency/
like here.
Similar for the bpf samples technically...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
  2025-12-04 12:48   ` Christian Loehle
@ 2025-12-04 14:57     ` Rafael J. Wysocki
  2025-12-04 16:48       ` Steven Rostedt
  2025-12-04 17:21       ` srinivas pandruvada
  0 siblings, 2 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-12-04 14:57 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Samuel Wu, Huang Rui, Gautham R. Shenoy, Mario Limonciello,
	Perry Yuan, Jonathan Corbet, Rafael J. Wysocki, Viresh Kumar,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Srinivas Pandruvada, Len Brown, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Ian Rogers, Adrian Hunter, James Clark,
	kernel-team, linux-pm, linux-doc, linux-kernel,
	linux-trace-kernel, bpf, linux-perf-users

On Thu, Dec 4, 2025 at 1:49 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 12/1/25 20:24, Samuel Wu wrote:
> > The existing cpu_frequency trace_event can be verbose, emitting a nearly
> > identical trace event for every CPU in the policy even when their
> > frequencies are identical.
> >
> > This patch replaces the cpu_frequency trace event with policy_frequency
> > trace event, a more efficient alternative. From the kernel's
> > perspective, emitting a trace event once per policy instead of once per
> > cpu saves some memory and is less overhead. From the post-processing
> > perspective, analysis of the trace log is simplified without any loss of
> > information.
> >
> > Signed-off-by: Samuel Wu <wusamuel@google.com>
> > ---
> >  drivers/cpufreq/cpufreq.c      | 14 ++------------
> >  drivers/cpufreq/intel_pstate.c |  6 ++++--
> >  include/trace/events/power.h   | 24 +++++++++++++++++++++---
> >  kernel/trace/power-traces.c    |  2 +-
> >  samples/bpf/cpustat_kern.c     |  8 ++++----
> >  samples/bpf/cpustat_user.c     |  6 +++---
> >  tools/perf/builtin-timechart.c | 12 ++++++------
> >  7 files changed, 41 insertions(+), 31 deletions(-)
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 4472bb1ec83c..dd3f08f3b958 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -309,8 +309,6 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
> >                                     struct cpufreq_freqs *freqs,
> >                                     unsigned int state)
> >  {
> > -     int cpu;
> > -
> >       BUG_ON(irqs_disabled());
> >
> >       if (cpufreq_disabled())
> > @@ -344,10 +342,7 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
> >               adjust_jiffies(CPUFREQ_POSTCHANGE, freqs);
> >               pr_debug("FREQ: %u - CPUs: %*pbl\n", freqs->new,
> >                        cpumask_pr_args(policy->cpus));
> > -
> > -             for_each_cpu(cpu, policy->cpus)
> > -                     trace_cpu_frequency(freqs->new, cpu);
> > -
> > +             trace_policy_frequency(freqs->new, policy->cpu, policy->cpus);
> >               srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
> >                                        CPUFREQ_POSTCHANGE, freqs);
> >
> > @@ -2201,7 +2196,6 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> >                                       unsigned int target_freq)
> >  {
> >       unsigned int freq;
> > -     int cpu;
> >
> >       target_freq = clamp_val(target_freq, policy->min, policy->max);
> >       freq = cpufreq_driver->fast_switch(policy, target_freq);
> > @@ -2213,11 +2207,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> >       arch_set_freq_scale(policy->related_cpus, freq,
> >                           arch_scale_freq_ref(policy->cpu));
> >       cpufreq_stats_record_transition(policy, freq);
> > -
> > -     if (trace_cpu_frequency_enabled()) {
> > -             for_each_cpu(cpu, policy->cpus)
> > -                     trace_cpu_frequency(freq, cpu);
> > -     }
> > +     trace_policy_frequency(freq, policy->cpu, policy->cpus);
> >
> >       return freq;
> >  }
> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > index ec4abe374573..9724b5d19d83 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -2297,7 +2297,8 @@ static int hwp_get_cpu_scaling(int cpu)
> >
> >  static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate)
> >  {
> > -     trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu);
> > +     trace_policy_frequency(pstate * cpu->pstate.scaling, cpu->cpu,
> > +                            cpumask_of(cpu->cpu));
> >       cpu->pstate.current_pstate = pstate;
> >       /*
> >        * Generally, there is no guarantee that this code will always run on
> > @@ -2587,7 +2588,8 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
> >
> >       target_pstate = get_target_pstate(cpu);
> >       target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> > -     trace_cpu_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu);
> > +     trace_policy_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu,
> > +                            cpumask_of(cpu->cpu));
> >       intel_pstate_update_pstate(cpu, target_pstate);
> >
> >       sample = &cpu->sample;
> > diff --git a/include/trace/events/power.h b/include/trace/events/power.h
> > index 370f8df2fdb4..317098ffdd5f 100644
> > --- a/include/trace/events/power.h
> > +++ b/include/trace/events/power.h
> > @@ -182,11 +182,29 @@ TRACE_EVENT(pstate_sample,
> >               { PM_EVENT_RECOVER, "recover" }, \
> >               { PM_EVENT_POWEROFF, "poweroff" })
> >
> > -DEFINE_EVENT(cpu, cpu_frequency,
> > +TRACE_EVENT(policy_frequency,
> >
> > -     TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > +     TP_PROTO(unsigned int frequency, unsigned int cpu_id,
> > +              const struct cpumask *policy_cpus),
> >
> > -     TP_ARGS(frequency, cpu_id)
> > +     TP_ARGS(frequency, cpu_id, policy_cpus),
> > +
> > +     TP_STRUCT__entry(
> > +             __field(u32, state)
> > +             __field(u32, cpu_id)
> > +             __cpumask(cpumask)
> > +     ),
> > +
> > +     TP_fast_assign(
> > +             __entry->state = frequency;
> > +             __entry->cpu_id = cpu_id;
> > +             __assign_cpumask(cpumask, policy_cpus);
> > +     ),
> > +
> > +     TP_printk("state=%lu cpu_id=%lu policy_cpus=%*pb",
> > +               (unsigned long)__entry->state,
> > +               (unsigned long)__entry->cpu_id,
> > +               cpumask_pr_args((struct cpumask *)__get_dynamic_array(cpumask)))
> >  );
> >
> >  TRACE_EVENT(cpu_frequency_limits,
> > diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
> > index f2fe33573e54..a537e68a6878 100644
> > --- a/kernel/trace/power-traces.c
> > +++ b/kernel/trace/power-traces.c
> > @@ -16,5 +16,5 @@
> >
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(suspend_resume);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
> > -EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_frequency);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(policy_frequency);
> >
> > diff --git a/samples/bpf/cpustat_kern.c b/samples/bpf/cpustat_kern.c
> > index 7ec7143e2757..f485de0f89b2 100644
> > --- a/samples/bpf/cpustat_kern.c
> > +++ b/samples/bpf/cpustat_kern.c
> > @@ -75,9 +75,9 @@ struct {
> >  } pstate_duration SEC(".maps");
> >
> >  /*
> > - * The trace events for cpu_idle and cpu_frequency are taken from:
> > + * The trace events for cpu_idle and policy_frequency are taken from:
> >   * /sys/kernel/tracing/events/power/cpu_idle/format
> > - * /sys/kernel/tracing/events/power/cpu_frequency/format
> > + * /sys/kernel/tracing/events/power/policy_frequency/format
> >   *
> >   * These two events have same format, so define one common structure.
> >   */
> > @@ -162,7 +162,7 @@ int bpf_prog1(struct cpu_args *ctx)
> >        */
> >       if (ctx->state != (u32)-1) {
> >
> > -             /* record pstate after have first cpu_frequency event */
> > +             /* record pstate after have first policy_frequency event */
> >               if (!*pts)
> >                       return 0;
> >
> > @@ -208,7 +208,7 @@ int bpf_prog1(struct cpu_args *ctx)
> >       return 0;
> >  }
> >
> > -SEC("tracepoint/power/cpu_frequency")
> > +SEC("tracepoint/power/policy_frequency")
> >  int bpf_prog2(struct cpu_args *ctx)
> >  {
> >       u64 *pts, *cstate, *pstate, cur_ts, delta;
> > diff --git a/samples/bpf/cpustat_user.c b/samples/bpf/cpustat_user.c
> > index 356f756cba0d..f7e81f702358 100644
> > --- a/samples/bpf/cpustat_user.c
> > +++ b/samples/bpf/cpustat_user.c
> > @@ -143,12 +143,12 @@ static int cpu_stat_inject_cpu_idle_event(void)
> >
> >  /*
> >   * It's possible to have no any frequency change for long time and cannot
> > - * get ftrace event 'trace_cpu_frequency' for long period, this introduces
> > + * get ftrace event 'trace_policy_frequency' for long period, this introduces
> >   * big deviation for pstate statistics.
> >   *
> >   * To solve this issue, below code forces to set 'scaling_max_freq' to 208MHz
> > - * for triggering ftrace event 'trace_cpu_frequency' and then recovery back to
> > - * the maximum frequency value 1.2GHz.
> > + * for triggering ftrace event 'trace_policy_frequency' and then recovery back
> > + * to the maximum frequency value 1.2GHz.
> >   */
> >  static int cpu_stat_inject_cpu_frequency_event(void)
> >  {
> > diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
> > index 22050c640dfa..3ef1a2fd0493 100644
> > --- a/tools/perf/builtin-timechart.c
> > +++ b/tools/perf/builtin-timechart.c
> > @@ -612,10 +612,10 @@ process_sample_cpu_idle(struct timechart *tchart __maybe_unused,
> >  }
> >
> >  static int
> > -process_sample_cpu_frequency(struct timechart *tchart,
> > -                          struct evsel *evsel,
> > -                          struct perf_sample *sample,
> > -                          const char *backtrace __maybe_unused)
> > +process_sample_policy_frequency(struct timechart *tchart,
> > +                             struct evsel *evsel,
> > +                             struct perf_sample *sample,
> > +                             const char *backtrace __maybe_unused)
> >  {
> >       u32 state  = evsel__intval(evsel, sample, "state");
> >       u32 cpu_id = evsel__intval(evsel, sample, "cpu_id");
> > @@ -1541,7 +1541,7 @@ static int __cmd_timechart(struct timechart *tchart, const char *output_name)
> >  {
> >       const struct evsel_str_handler power_tracepoints[] = {
> >               { "power:cpu_idle",             process_sample_cpu_idle },
> > -             { "power:cpu_frequency",        process_sample_cpu_frequency },
> > +             { "power:policy_frequency",     process_sample_policy_frequency },
> >               { "sched:sched_wakeup",         process_sample_sched_wakeup },
> >               { "sched:sched_switch",         process_sample_sched_switch },
> >  #ifdef SUPPORT_OLD_POWER_EVENTS
> > @@ -1804,7 +1804,7 @@ static int timechart__record(struct timechart *tchart, int argc, const char **ar
> >       unsigned int backtrace_args_no = ARRAY_SIZE(backtrace_args);
> >
> >       const char * const power_args[] = {
> > -             "-e", "power:cpu_frequency",
> > +             "-e", "power:policy_frequency",
> >               "-e", "power:cpu_idle",
> >       };
> >       unsigned int power_args_nr = ARRAY_SIZE(power_args);
>
> perf timechart seem to do per-CPU reporting though?
> So this is broken by not emitting an event per-CPU? At least with a simple s/cpu_frequency/policy_frequency/
> like here.
> Similar for the bpf samples technically...

This kind of boils down to whether or not tracepoints can be regarded
as ABI and to what extent.

In this particular case, I'm not sure I agree with the stated motivation.

First of all, on systems with one CPU per cpufreq policy (the vast
majority of x86, including AMD, and the ARM systems using the CPPC
driver AFAICS), the "issue" at hand is actually a non-issue and
changing the name of the tracepoint alone would confuse things in user
space IIUC.  Those need to work the way they do today.

On systems with multiple CPUs per cpufreq policy there is some extra
overhead related to the cpu_frequency tracepoint, but the if someone
is only interested in the "policy" frequency, they can filter out all
CPUs belonging to the same policy except for one from the traces,
don't they?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
  2025-12-04 14:57     ` Rafael J. Wysocki
@ 2025-12-04 16:48       ` Steven Rostedt
  2025-12-04 17:24         ` Rafael J. Wysocki
  2025-12-04 17:21       ` srinivas pandruvada
  1 sibling, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2025-12-04 16:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, Samuel Wu, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Perry Yuan, Jonathan Corbet, Viresh Kumar,
	Masami Hiramatsu, Mathieu Desnoyers, Srinivas Pandruvada,
	Len Brown, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, James Clark, kernel-team, linux-pm, linux-doc,
	linux-kernel, linux-trace-kernel, bpf, linux-perf-users

On Thu, 4 Dec 2025 15:57:41 +0100
"Rafael J. Wysocki" <rafael@kernel.org> wrote:

> > perf timechart seem to do per-CPU reporting though?
> > So this is broken by not emitting an event per-CPU? At least with a simple s/cpu_frequency/policy_frequency/
> > like here.
> > Similar for the bpf samples technically...  
> 
> This kind of boils down to whether or not tracepoints can be regarded
> as ABI and to what extent.

They are an ABI and they are not an ABI. It really boils down to "if you
break the ABI but no user space notices, did you really break the ABI?" the
answer is "no". But if user space notices, then yes you did. But it is
possible to still fix user space (I did this with powertop).

> 
> In this particular case, I'm not sure I agree with the stated motivation.
> 
> First of all, on systems with one CPU per cpufreq policy (the vast
> majority of x86, including AMD, and the ARM systems using the CPPC
> driver AFAICS), the "issue" at hand is actually a non-issue and
> changing the name of the tracepoint alone would confuse things in user
> space IIUC.  Those need to work the way they do today.

If the way the tracepoint changes, it's best to change the name too.
Tooling can check to see which name is available to determine how to
process the traces.

> 
> On systems with multiple CPUs per cpufreq policy there is some extra
> overhead related to the cpu_frequency tracepoint, but the if someone
> is only interested in the "policy" frequency, they can filter out all
> CPUs belonging to the same policy except for one from the traces,
> don't they?

I'm not exactly sure what you mean here. There is an "onchange" trigger you
can use to trigger a synthetic event whenever a change happens. But I think
the data here wants to know which CPU had its policy change. Hence the CPU
mask.

-- Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
  2025-12-04 14:57     ` Rafael J. Wysocki
  2025-12-04 16:48       ` Steven Rostedt
@ 2025-12-04 17:21       ` srinivas pandruvada
  2025-12-04 18:47         ` Steven Rostedt
  1 sibling, 1 reply; 14+ messages in thread
From: srinivas pandruvada @ 2025-12-04 17:21 UTC (permalink / raw)
  To: Rafael J. Wysocki, Christian Loehle
  Cc: Samuel Wu, Huang Rui, Gautham R. Shenoy, Mario Limonciello,
	Perry Yuan, Jonathan Corbet, Viresh Kumar, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Len Brown,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, James Clark, kernel-team, linux-pm, linux-doc,
	linux-kernel, linux-trace-kernel, bpf, linux-perf-users

On Thu, 2025-12-04 at 15:57 +0100, Rafael J. Wysocki wrote:
> On Thu, Dec 4, 2025 at 1:49 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
> > 
> > On 12/1/25 20:24, Samuel Wu wrote:
> > > The existing cpu_frequency trace_event can be verbose, emitting a
> > > nearly
> > > identical trace event for every CPU in the policy even when their
> > > frequencies are identical.
> > > 
> > > This patch replaces the cpu_frequency trace event with
> > > policy_frequency
> > > trace event, a more efficient alternative. From the kernel's
> > > perspective, emitting a trace event once per policy instead of
> > > once per
> > > cpu saves some memory and is less overhead. From the post-
> > > processing
> > > perspective, analysis of the trace log is simplified without any
> > > loss of
> > > information.
> > > 
> > > Signed-off-by: Samuel Wu <wusamuel@google.com>
> > > ---
> > >  drivers/cpufreq/cpufreq.c      | 14 ++------------
> > >  drivers/cpufreq/intel_pstate.c |  6 ++++--
> > >  include/trace/events/power.h   | 24 +++++++++++++++++++++---
> > >  kernel/trace/power-traces.c    |  2 +-
> > >  samples/bpf/cpustat_kern.c     |  8 ++++----
> > >  samples/bpf/cpustat_user.c     |  6 +++---
> > >  tools/perf/builtin-timechart.c | 12 ++++++------
> > >  7 files changed, 41 insertions(+), 31 deletions(-)
> > > 
> > > diff --git a/drivers/cpufreq/cpufreq.c
> > > b/drivers/cpufreq/cpufreq.c
> > > index 4472bb1ec83c..dd3f08f3b958 100644
> > > --- a/drivers/cpufreq/cpufreq.c
> > > +++ b/drivers/cpufreq/cpufreq.c
> > > @@ -309,8 +309,6 @@ static void cpufreq_notify_transition(struct
> > > cpufreq_policy *policy,
> > >                                     struct cpufreq_freqs *freqs,
> > >                                     unsigned int state)
> > >  {
> > > -     int cpu;
> > > -
> > >       BUG_ON(irqs_disabled());
> > > 
> > >       if (cpufreq_disabled())
> > > @@ -344,10 +342,7 @@ static void cpufreq_notify_transition(struct
> > > cpufreq_policy *policy,
> > >               adjust_jiffies(CPUFREQ_POSTCHANGE, freqs);
> > >               pr_debug("FREQ: %u - CPUs: %*pbl\n", freqs->new,
> > >                        cpumask_pr_args(policy->cpus));
> > > -
> > > -             for_each_cpu(cpu, policy->cpus)
> > > -                     trace_cpu_frequency(freqs->new, cpu);
> > > -
> > > +             trace_policy_frequency(freqs->new, policy->cpu,
> > > policy->cpus);
> > >              
> > > srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
> > >                                        CPUFREQ_POSTCHANGE,
> > > freqs);
> > > 
> > > @@ -2201,7 +2196,6 @@ unsigned int
> > > cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> > >                                       unsigned int target_freq)
> > >  {
> > >       unsigned int freq;
> > > -     int cpu;
> > > 
> > >       target_freq = clamp_val(target_freq, policy->min, policy-
> > > >max);
> > >       freq = cpufreq_driver->fast_switch(policy, target_freq);
> > > @@ -2213,11 +2207,7 @@ unsigned int
> > > cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> > >       arch_set_freq_scale(policy->related_cpus, freq,
> > >                           arch_scale_freq_ref(policy->cpu));
> > >       cpufreq_stats_record_transition(policy, freq);
> > > -
> > > -     if (trace_cpu_frequency_enabled()) {
> > > -             for_each_cpu(cpu, policy->cpus)
> > > -                     trace_cpu_frequency(freq, cpu);
> > > -     }
> > > +     trace_policy_frequency(freq, policy->cpu, policy->cpus);
> > > 
> > >       return freq;
> > >  }
> > > diff --git a/drivers/cpufreq/intel_pstate.c
> > > b/drivers/cpufreq/intel_pstate.c
> > > index ec4abe374573..9724b5d19d83 100644
> > > --- a/drivers/cpufreq/intel_pstate.c
> > > +++ b/drivers/cpufreq/intel_pstate.c
> > > @@ -2297,7 +2297,8 @@ static int hwp_get_cpu_scaling(int cpu)
> > > 
> > >  static void intel_pstate_set_pstate(struct cpudata *cpu, int
> > > pstate)
> > >  {
> > > -     trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu-
> > > >cpu);
> > > +     trace_policy_frequency(pstate * cpu->pstate.scaling, cpu-
> > > >cpu,
> > > +                            cpumask_of(cpu->cpu));
> > >       cpu->pstate.current_pstate = pstate;
> > >       /*
> > >        * Generally, there is no guarantee that this code will
> > > always run on
> > > @@ -2587,7 +2588,8 @@ static void
> > > intel_pstate_adjust_pstate(struct cpudata *cpu)
> > > 
> > >       target_pstate = get_target_pstate(cpu);
> > >       target_pstate = intel_pstate_prepare_request(cpu,
> > > target_pstate);
> > > -     trace_cpu_frequency(target_pstate * cpu->pstate.scaling,
> > > cpu->cpu);
> > > +     trace_policy_frequency(target_pstate * cpu->pstate.scaling,
> > > cpu->cpu,
> > > +                            cpumask_of(cpu->cpu));
> > >       intel_pstate_update_pstate(cpu, target_pstate);
> > > 
> > >       sample = &cpu->sample;
> > > diff --git a/include/trace/events/power.h
> > > b/include/trace/events/power.h
> > > index 370f8df2fdb4..317098ffdd5f 100644
> > > --- a/include/trace/events/power.h
> > > +++ b/include/trace/events/power.h
> > > @@ -182,11 +182,29 @@ TRACE_EVENT(pstate_sample,
> > >               { PM_EVENT_RECOVER, "recover" }, \
> > >               { PM_EVENT_POWEROFF, "poweroff" })
> > > 
> > > -DEFINE_EVENT(cpu, cpu_frequency,
> > > +TRACE_EVENT(policy_frequency,
> > > 
> > > -     TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > > +     TP_PROTO(unsigned int frequency, unsigned int cpu_id,
> > > +              const struct cpumask *policy_cpus),
> > > 
> > > -     TP_ARGS(frequency, cpu_id)
> > > +     TP_ARGS(frequency, cpu_id, policy_cpus),
> > > +
> > > +     TP_STRUCT__entry(
> > > +             __field(u32, state)
> > > +             __field(u32, cpu_id)
> > > +             __cpumask(cpumask)
> > > +     ),
> > > +
> > > +     TP_fast_assign(
> > > +             __entry->state = frequency;
> > > +             __entry->cpu_id = cpu_id;
> > > +             __assign_cpumask(cpumask, policy_cpus);
> > > +     ),
> > > +
> > > +     TP_printk("state=%lu cpu_id=%lu policy_cpus=%*pb",
> > > +               (unsigned long)__entry->state,
> > > +               (unsigned long)__entry->cpu_id,
> > > +               cpumask_pr_args((struct cpumask
> > > *)__get_dynamic_array(cpumask)))
> > >  );
> > > 
> > >  TRACE_EVENT(cpu_frequency_limits,
> > > diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-
> > > traces.c
> > > index f2fe33573e54..a537e68a6878 100644
> > > --- a/kernel/trace/power-traces.c
> > > +++ b/kernel/trace/power-traces.c
> > > @@ -16,5 +16,5 @@
> > > 
> > >  EXPORT_TRACEPOINT_SYMBOL_GPL(suspend_resume);
> > >  EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
> > > -EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_frequency);
> > > +EXPORT_TRACEPOINT_SYMBOL_GPL(policy_frequency);
> > > 
> > > diff --git a/samples/bpf/cpustat_kern.c
> > > b/samples/bpf/cpustat_kern.c
> > > index 7ec7143e2757..f485de0f89b2 100644
> > > --- a/samples/bpf/cpustat_kern.c
> > > +++ b/samples/bpf/cpustat_kern.c
> > > @@ -75,9 +75,9 @@ struct {
> > >  } pstate_duration SEC(".maps");
> > > 
> > >  /*
> > > - * The trace events for cpu_idle and cpu_frequency are taken
> > > from:
> > > + * The trace events for cpu_idle and policy_frequency are taken
> > > from:
> > >   * /sys/kernel/tracing/events/power/cpu_idle/format
> > > - * /sys/kernel/tracing/events/power/cpu_frequency/format
> > > + * /sys/kernel/tracing/events/power/policy_frequency/format
> > >   *
> > >   * These two events have same format, so define one common
> > > structure.
> > >   */
> > > @@ -162,7 +162,7 @@ int bpf_prog1(struct cpu_args *ctx)
> > >        */
> > >       if (ctx->state != (u32)-1) {
> > > 
> > > -             /* record pstate after have first cpu_frequency
> > > event */
> > > +             /* record pstate after have first policy_frequency
> > > event */
> > >               if (!*pts)
> > >                       return 0;
> > > 
> > > @@ -208,7 +208,7 @@ int bpf_prog1(struct cpu_args *ctx)
> > >       return 0;
> > >  }
> > > 
> > > -SEC("tracepoint/power/cpu_frequency")
> > > +SEC("tracepoint/power/policy_frequency")
> > >  int bpf_prog2(struct cpu_args *ctx)
> > >  {
> > >       u64 *pts, *cstate, *pstate, cur_ts, delta;
> > > diff --git a/samples/bpf/cpustat_user.c
> > > b/samples/bpf/cpustat_user.c
> > > index 356f756cba0d..f7e81f702358 100644
> > > --- a/samples/bpf/cpustat_user.c
> > > +++ b/samples/bpf/cpustat_user.c
> > > @@ -143,12 +143,12 @@ static int
> > > cpu_stat_inject_cpu_idle_event(void)
> > > 
> > >  /*
> > >   * It's possible to have no any frequency change for long time
> > > and cannot
> > > - * get ftrace event 'trace_cpu_frequency' for long period, this
> > > introduces
> > > + * get ftrace event 'trace_policy_frequency' for long period,
> > > this introduces
> > >   * big deviation for pstate statistics.
> > >   *
> > >   * To solve this issue, below code forces to set
> > > 'scaling_max_freq' to 208MHz
> > > - * for triggering ftrace event 'trace_cpu_frequency' and then
> > > recovery back to
> > > - * the maximum frequency value 1.2GHz.
> > > + * for triggering ftrace event 'trace_policy_frequency' and then
> > > recovery back
> > > + * to the maximum frequency value 1.2GHz.
> > >   */
> > >  static int cpu_stat_inject_cpu_frequency_event(void)
> > >  {
> > > diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-
> > > timechart.c
> > > index 22050c640dfa..3ef1a2fd0493 100644
> > > --- a/tools/perf/builtin-timechart.c
> > > +++ b/tools/perf/builtin-timechart.c
> > > @@ -612,10 +612,10 @@ process_sample_cpu_idle(struct timechart
> > > *tchart __maybe_unused,
> > >  }
> > > 
> > >  static int
> > > -process_sample_cpu_frequency(struct timechart *tchart,
> > > -                          struct evsel *evsel,
> > > -                          struct perf_sample *sample,
> > > -                          const char *backtrace __maybe_unused)
> > > +process_sample_policy_frequency(struct timechart *tchart,
> > > +                             struct evsel *evsel,
> > > +                             struct perf_sample *sample,
> > > +                             const char *backtrace
> > > __maybe_unused)
> > >  {
> > >       u32 state  = evsel__intval(evsel, sample, "state");
> > >       u32 cpu_id = evsel__intval(evsel, sample, "cpu_id");
> > > @@ -1541,7 +1541,7 @@ static int __cmd_timechart(struct timechart
> > > *tchart, const char *output_name)
> > >  {
> > >       const struct evsel_str_handler power_tracepoints[] = {
> > >               { "power:cpu_idle",            
> > > process_sample_cpu_idle },
> > > -             { "power:cpu_frequency",       
> > > process_sample_cpu_frequency },
> > > +             { "power:policy_frequency",    
> > > process_sample_policy_frequency },
> > >               { "sched:sched_wakeup",        
> > > process_sample_sched_wakeup },
> > >               { "sched:sched_switch",        
> > > process_sample_sched_switch },
> > >  #ifdef SUPPORT_OLD_POWER_EVENTS
> > > @@ -1804,7 +1804,7 @@ static int timechart__record(struct
> > > timechart *tchart, int argc, const char **ar
> > >       unsigned int backtrace_args_no =
> > > ARRAY_SIZE(backtrace_args);
> > > 
> > >       const char * const power_args[] = {
> > > -             "-e", "power:cpu_frequency",
> > > +             "-e", "power:policy_frequency",
> > >               "-e", "power:cpu_idle",
> > >       };
> > >       unsigned int power_args_nr = ARRAY_SIZE(power_args);
> > 
> > perf timechart seem to do per-CPU reporting though?
> > So this is broken by not emitting an event per-CPU? At least with a
> > simple s/cpu_frequency/policy_frequency/
> > like here.
> > Similar for the bpf samples technically...
> 
> This kind of boils down to whether or not tracepoints can be regarded
> as ABI and to what extent.
> 

We have tools using tracing. We may need to check those tools.

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py?h=next-20251204
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/tools/power/x86/amd_pstate_tracer/amd_pstate_trace.py?h=next-20251204

Thanks,
Srinivas


> In this particular case, I'm not sure I agree with the stated
> motivation.
> 
> First of all, on systems with one CPU per cpufreq policy (the vast
> majority of x86, including AMD, and the ARM systems using the CPPC
> driver AFAICS), the "issue" at hand is actually a non-issue and
> changing the name of the tracepoint alone would confuse things in
> user
> space IIUC.  Those need to work the way they do today.
> 
> On systems with multiple CPUs per cpufreq policy there is some extra
> overhead related to the cpu_frequency tracepoint, but the if someone
> is only interested in the "policy" frequency, they can filter out all
> CPUs belonging to the same policy except for one from the traces,
> don't they?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
  2025-12-04 16:48       ` Steven Rostedt
@ 2025-12-04 17:24         ` Rafael J. Wysocki
  2025-12-04 18:46           ` Steven Rostedt
  0 siblings, 1 reply; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-12-04 17:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rafael J. Wysocki, Christian Loehle, Samuel Wu, Huang Rui,
	Gautham R. Shenoy, Mario Limonciello, Perry Yuan, Jonathan Corbet,
	Viresh Kumar, Masami Hiramatsu, Mathieu Desnoyers,
	Srinivas Pandruvada, Len Brown, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Ian Rogers, Adrian Hunter, James Clark,
	kernel-team, linux-pm, linux-doc, linux-kernel,
	linux-trace-kernel, bpf, linux-perf-users

On Thu, Dec 4, 2025 at 5:48 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 4 Dec 2025 15:57:41 +0100
> "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>
> > > perf timechart seem to do per-CPU reporting though?
> > > So this is broken by not emitting an event per-CPU? At least with a simple s/cpu_frequency/policy_frequency/
> > > like here.
> > > Similar for the bpf samples technically...
> >
> > This kind of boils down to whether or not tracepoints can be regarded
> > as ABI and to what extent.
>
> They are an ABI and they are not an ABI. It really boils down to "if you
> break the ABI but no user space notices, did you really break the ABI?" the
> answer is "no". But if user space notices, then yes you did. But it is
> possible to still fix user space (I did this with powertop).

My concern is that the patch effectively removes one trace point
(cpu_frequency) and adds another one with a different format
(policy_frequency), updates one utility in the kernel tree and expects
everyone else to somehow know that they should switch over.

I know about at least several people who have their own scripts using
this tracepoint though.

> >
> > In this particular case, I'm not sure I agree with the stated motivation.
> >
> > First of all, on systems with one CPU per cpufreq policy (the vast
> > majority of x86, including AMD, and the ARM systems using the CPPC
> > driver AFAICS), the "issue" at hand is actually a non-issue and
> > changing the name of the tracepoint alone would confuse things in user
> > space IIUC.  Those need to work the way they do today.
>
> If the way the tracepoint changes, it's best to change the name too.
> Tooling can check to see which name is available to determine how to
> process the traces.

If it is updated to do so, yes, but in the meantime?

> >
> > On systems with multiple CPUs per cpufreq policy there is some extra
> > overhead related to the cpu_frequency tracepoint, but the if someone
> > is only interested in the "policy" frequency, they can filter out all
> > CPUs belonging to the same policy except for one from the traces,
> > don't they?
>
> I'm not exactly sure what you mean here. There is an "onchange" trigger you
> can use to trigger a synthetic event whenever a change happens. But I think
> the data here wants to know which CPU had its policy change. Hence the CPU
> mask.

IIUC he wants to trace frequency changes per policy, not per CPU
(because there are cases in which multiple CPUs belong to one policy
and arguably the frequency doesn't need to be traced for all of them),
but tooling should know which CPUs belong to the same policy, so it
should be straightforward to use that knowledge when processing the
traces.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
  2025-12-04 17:24         ` Rafael J. Wysocki
@ 2025-12-04 18:46           ` Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2025-12-04 18:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, Samuel Wu, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Perry Yuan, Jonathan Corbet, Viresh Kumar,
	Masami Hiramatsu, Mathieu Desnoyers, Srinivas Pandruvada,
	Len Brown, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, James Clark, kernel-team, linux-pm, linux-doc,
	linux-kernel, linux-trace-kernel, bpf, linux-perf-users

On Thu, 4 Dec 2025 18:24:57 +0100
"Rafael J. Wysocki" <rafael@kernel.org> wrote:

> > I'm not exactly sure what you mean here. There is an "onchange" trigger you
> > can use to trigger a synthetic event whenever a change happens. But I think
> > the data here wants to know which CPU had its policy change. Hence the CPU
> > mask.  
> 
> IIUC he wants to trace frequency changes per policy, not per CPU
> (because there are cases in which multiple CPUs belong to one policy
> and arguably the frequency doesn't need to be traced for all of them),
> but tooling should know which CPUs belong to the same policy, so it
> should be straightforward to use that knowledge when processing the
> traces.

In case you only care about frequency changes, you could do this:

 # echo 'freq_change u32 state;' > /sys/kernel/tracing/synthetic_events 
 # echo 'hist:keys=common_type:s=state:onchange($s).trace(freq_change,$s)' > /sys/kernel/tracing/events/power/cpu_frequency/trigger 
 # echo 1 > /sys/kernel/tracing/events/synthetic/freq_change/enable 
 # cat /sys/kernel/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 2833/2833   #P:56
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
             sed-596089  [034] d..5. 2687140.288806: freq_change: state=2000000
            bash-596090  [020] d.s4. 2687140.290407: freq_change: state=1900000
          <idle>-0       [028] d.s7. 2687140.290425: freq_change: state=3000000
            bash-596090  [020] d..5. 2687140.291152: freq_change: state=1900000
          <idle>-0       [000] dNs5. 2687140.326526: freq_change: state=1200000
       CPU 3/KVM-10724   [019] d.s5. 2687140.358418: freq_change: state=2100000
       CPU 6/KVM-10727   [021] d.h5. 2687140.394403: freq_change: state=1300000
       CPU 6/KVM-10727   [021] d.h5. 2687140.398403: freq_change: state=1400000
       CPU 6/KVM-10727   [021] d.h5. 2687140.402402: freq_change: state=1500000
       CPU 6/KVM-10727   [021] d.h5. 2687140.406400: freq_change: state=1600000
       CPU 6/KVM-10727   [021] d.h5. 2687140.410404: freq_change: state=1700000
[..]

Which BTW, I'll be giving a talk about synthetic events at OSS in Tokyo ;-)
   (cheap plug!)

https://ossjapan2025.sched.com/event/29FoB/synthetic-events-and-tracing-latency-within-the-kernel-steven-rostedt-google?iframe=yes&w=100%&sidebar=yes&bg=no

-- Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/2] cpufreq: Replace trace_cpu_frequency with trace_policy_frequency
  2025-12-04 17:21       ` srinivas pandruvada
@ 2025-12-04 18:47         ` Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2025-12-04 18:47 UTC (permalink / raw)
  To: srinivas pandruvada
  Cc: Rafael J. Wysocki, Christian Loehle, Samuel Wu, Huang Rui,
	Gautham R. Shenoy, Mario Limonciello, Perry Yuan, Jonathan Corbet,
	Viresh Kumar, Masami Hiramatsu, Mathieu Desnoyers, Len Brown,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
	Adrian Hunter, James Clark, kernel-team, linux-pm, linux-doc,
	linux-kernel, linux-trace-kernel, bpf, linux-perf-users

On Thu, 04 Dec 2025 09:21:13 -0800
srinivas pandruvada <srinivas.pandruvada@linux.intel.com> wrote:

> We have tools using tracing. We may need to check those tools.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py?h=next-20251204

I only see this using pstate_sample event.

> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/tools/power/x86/amd_pstate_tracer/amd_pstate_trace.py?h=next-20251204

I only see this using amd_cpu event.

-- Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-12-04 18:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-01 20:24 [PATCH v3 0/2] Replace trace_cpu_frequency with trace_policy_frequency Samuel Wu
2025-12-01 20:24 ` [PATCH v3 1/2] cpufreq: " Samuel Wu
2025-12-02 12:03   ` Douglas Raillard
2025-12-04  0:33     ` Samuel Wu
2025-12-04 12:48   ` Christian Loehle
2025-12-04 14:57     ` Rafael J. Wysocki
2025-12-04 16:48       ` Steven Rostedt
2025-12-04 17:24         ` Rafael J. Wysocki
2025-12-04 18:46           ` Steven Rostedt
2025-12-04 17:21       ` srinivas pandruvada
2025-12-04 18:47         ` Steven Rostedt
2025-12-01 20:24 ` [PATCH v3 2/2] cpufreq: Documentation update for trace_policy_frequency Samuel Wu
2025-12-01 23:11   ` Tiffany Yang
2025-12-02  6:43 ` [PATCH v3 0/2] Replace trace_cpu_frequency with trace_policy_frequency Viresh Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).