[PATCH v4] perf/core: Fix sampling period inconsistency across CPU migration

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4] perf/core: Fix sampling period inconsistency across CPU migration
@ 2026-04-29  9:51 Minwoo Ahn
  2026-05-04  8:08 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: Minwoo Ahn @ 2026-04-29  9:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, James Clark, Jinkyu Jeong, Minwoo Ahn,
	linux-perf-users, linux-kernel


When per-task software events are sampled, period_left is not
managed consistently when task migration happens. The perf_event
may observe a different hw_perf_event::period_left on the new CPU,
breaking the sampling periodicity. Even if a task was near its
sampling point, it would use a stale period_left after migration.

Introduce struct perf_task_context as a per-task container to
preserve period_left across CPU migrations. A separate structure
is used rather than adding fields to hw_perf_event, because
hw_perf_event is a general-purpose structure shared by all event
types (hardware, software, tracepoint, breakpoint, etc.) and
embedding per-task sampling state there would bloat it for the
majority of events that do not need it. perf_task_context is only
allocated for per-task software sampling events.

Multiple per-CPU perf_event instances originating from the same
perf_event_open caller share a single perf_task_context via
refcounting. The perf_event owner field is used to distinguish events
from different perf_event_open callers, preventing unrelated sampling
sessions from interfering with each other. For inherited events
(where owner is NULL), child events are matched by verifying that
their parent events share the same perf_task_ctxp, ensuring only
events from the same profiling session share context. The
allocation condition for inherited events checks that the parent
event actually has a perf_task_ctxp, ensuring only genuine
software events propagate the context. The perf_task_context
lookup uses perf_lock_task_context() to safely access the task's
event context under proper RCU and IRQ protection.

perf_task_context serves purely as a transport for period_left
across CPU migrations. On event removal (swevent_del for non-clock
events, cancel_hrtimer for clock events), hw_perf_event::period_left
is backed up to perf_task_context::period_left. On event addition
(swevent_add for non-clock events, start_hrtimer for clock events),
perf_task_context::period_left is restored to hw_perf_event::period_left.
During normal operation between migrations, hw_perf_event::period_left
remains the sole working copy, keeping existing code paths unaffected.

To reproduce, force CPU migration during task-clock sampling:

  $ sysbench cpu --threads=1 --time=60 run &
  $ sleep 0.1
  $ TID=$(ls /proc/$!/task/ | grep -v "^$!$")
  $ perf record -e task-clock -c 1000000000 -t $TID &

  # Force migration across CPUs every 1.2 seconds
  $ while kill -0 $TID 2>/dev/null; do
        taskset -p -c 0 $TID; sleep 1.2
        taskset -p -c 1 $TID; sleep 1.2
        taskset -p -c 2 $TID; sleep 1.2
    done

  # Check sample intervals (expected: ~1.000s each)
  $ perf script -F time | \
        awk 'NR==1 {prev=$1; next} {print $1-prev; prev=$1}'

Without this patch, sample intervals show significant deviation
from the expected 1-second period after each migration. With this
patch, intervals remain consistent.

Co-developed-by: Jinkyu Jeong <jinkyu@yonsei.ac.kr>
Signed-off-by: Jinkyu Jeong <jinkyu@yonsei.ac.kr>
Signed-off-by: Minwoo Ahn <mwahn402@gmail.com>
---
Changes in v4:
- Fix perf_event_context reference leak by adding put_ctx() after
  perf_lock_task_context()
- Use parent event lineage (parent->perf_task_ctxp) to match inherited
  events, preventing cross-session interference when multiple profiling
  sessions monitor the same task

Changes in v3:
- Move struct perf_task_context and perf_event_equal_task_ctx macro
  inside #ifdef CONFIG_PERF_EVENTS guard to fix build error on
  configs where CONFIG_PERF_EVENTS is disabled (local64_t undefined)

Changes in v2:
- Use perf_lock_task_context() to safely access the task's event context,
  avoiding a potential use-after-free and IRQ inversion deadlock
- Tighten allocation condition for inherited events by checking
  parent_event->perf_task_ctxp instead of just parent_event

 include/linux/perf_event.h | 18 +++++++++++
 kernel/events/core.c       | 81 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 48d851fbd8ea..f8c973325c76 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -829,6 +829,9 @@ struct perf_event {
 	u16				read_size;
 	struct hw_perf_event		hw;
 
+	/* Per-task sampling state for sw events, survives CPU migration */
+	struct perf_task_context	*perf_task_ctxp;
+
 	struct perf_event_context	*ctx;
 	/*
 	 * event->pmu_ctx points to perf_event_pmu_context in which the event
@@ -1207,6 +1210,21 @@ perf_cgroup_from_task(struct task_struct *task, struct perf_event_context *ctx)
 
 #ifdef CONFIG_PERF_EVENTS
 
+#define perf_event_equal_task_ctx(a1, a2)	\
+	((a1)->config == (a2)->config &&	\
+	 (a1)->sample_period == (a2)->sample_period)
+
+/**
+ * struct perf_task_context - per-task software event context
+ *
+ * Shared across per-CPU perf_event instances of the same task to
+ * preserve period_left across CPU migrations.
+ */
+struct perf_task_context {
+	refcount_t			refcount;
+	local64_t			period_left;
+};
+
 extern struct perf_event_context *perf_cpu_task_ctx(void);
 
 extern void *perf_aux_output_begin(struct perf_output_handle *handle,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6d1f8bad7e1c..eec0e822ef6e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5740,6 +5740,13 @@ static bool exclusive_event_installable(struct perf_event *event,
 
 static void perf_free_addr_filters(struct perf_event *event);
 
+static void perf_put_task_ctxp(struct perf_event *event)
+{
+	if (event->perf_task_ctxp &&
+	    refcount_dec_and_test(&event->perf_task_ctxp->refcount))
+		kfree(event->perf_task_ctxp);
+}
+
 /* vs perf_event_alloc() error */
 static void __free_event(struct perf_event *event)
 {
@@ -5761,6 +5768,9 @@ static void __free_event(struct perf_event *event)
 	if (event->attach_state & PERF_ATTACH_TASK_DATA)
 		detach_perf_ctx_data(event);
 
+	if (event->perf_task_ctxp)
+		perf_put_task_ctxp(event);
+
 	if (event->destroy)
 		event->destroy(event);
 
@@ -11054,9 +11064,14 @@ static void perf_swevent_read(struct perf_event *event)
 static int perf_swevent_add(struct perf_event *event, int flags)
 {
 	struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
+	struct perf_task_context *ctxp = event->perf_task_ctxp;
 	struct hw_perf_event *hwc = &event->hw;
 	struct hlist_head *head;
 
+	if (ctxp)
+		local64_set(&hwc->period_left,
+			    local64_read(&ctxp->period_left));
+
 	if (is_sampling_event(event)) {
 		hwc->last_period = hwc->sample_period;
 		perf_swevent_set_period(event);
@@ -11076,7 +11091,13 @@ static int perf_swevent_add(struct perf_event *event, int flags)
 
 static void perf_swevent_del(struct perf_event *event, int flags)
 {
+	struct perf_task_context *ctxp = event->perf_task_ctxp;
+
 	hlist_del_rcu(&event->hlist_entry);
+
+	if (ctxp)
+		local64_set(&ctxp->period_left,
+			    local64_read(&event->hw.period_left));
 }
 
 static void perf_swevent_start(struct perf_event *event, int flags)
@@ -12203,12 +12224,17 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 
 static void perf_swevent_start_hrtimer(struct perf_event *event)
 {
+	struct perf_task_context *ctxp = event->perf_task_ctxp;
 	struct hw_perf_event *hwc = &event->hw;
 	s64 period;
 
 	if (!is_sampling_event(event))
 		return;
 
+	if (ctxp)
+		local64_set(&hwc->period_left,
+			    local64_read(&ctxp->period_left));
+
 	period = local64_read(&hwc->period_left);
 	if (period) {
 		if (period < 0)
@@ -12224,6 +12250,7 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
 
 static void perf_swevent_cancel_hrtimer(struct perf_event *event)
 {
+	struct perf_task_context *ctxp = event->perf_task_ctxp;
 	struct hw_perf_event *hwc = &event->hw;
 
 	/*
@@ -12238,8 +12265,13 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
 	 */
 	if (is_sampling_event(event) && (hwc->interrupts != MAX_INTERRUPTS)) {
 		ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
+
 		local64_set(&hwc->period_left, ktime_to_ns(remaining));
 
+		if (ctxp)
+			local64_set(&ctxp->period_left,
+				    ktime_to_ns(remaining));
+
 		hrtimer_try_to_cancel(&hwc->hrtimer);
 	}
 }
@@ -13259,6 +13291,45 @@ static void account_event(struct perf_event *event)
 	account_pmu_sb_event(event);
 }
 
+static struct perf_task_context *
+perf_get_task_ctxp(struct perf_event *event, struct task_struct *task,
+		   struct perf_event *parent_event)
+{
+	struct perf_task_context *ctxp = NULL;
+	struct perf_event_context *ctx;
+	struct perf_event *iter;
+	unsigned long flags;
+
+	ctx = perf_lock_task_context(task, &flags);
+	if (ctx) {
+		list_for_each_entry(iter, &ctx->event_list, event_entry) {
+			if (iter->perf_task_ctxp &&
+			    (iter->owner == current ||
+			     (parent_event && !iter->owner &&
+			      iter->parent &&
+			      iter->parent->perf_task_ctxp ==
+			      parent_event->perf_task_ctxp)) &&
+			    perf_event_equal_task_ctx(&iter->attr,
+						     &event->attr)) {
+				ctxp = iter->perf_task_ctxp;
+				refcount_inc(&ctxp->refcount);
+				break;
+			}
+		}
+		raw_spin_unlock_irqrestore(&ctx->lock, flags);
+		put_ctx(ctx);
+	}
+
+	if (!ctxp) {
+		ctxp = kzalloc_obj(struct perf_task_context);
+		if (!ctxp)
+			return NULL;
+		refcount_set(&ctxp->refcount, 1);
+	}
+
+	return ctxp;
+}
+
 /*
  * Allocate and initialize an event structure
  */
@@ -13344,6 +13415,16 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		 * pmu before we get a ctx.
 		 */
 		event->hw.target = get_task_struct(task);
+
+		if (attr->sample_period &&
+		    attr->config < PERF_COUNT_SW_MAX &&
+		    (attr->type == PERF_TYPE_SOFTWARE ||
+		     (parent_event && parent_event->perf_task_ctxp))) {
+			event->perf_task_ctxp = perf_get_task_ctxp(event, task,
+							parent_event);
+			if (!event->perf_task_ctxp)
+				return ERR_PTR(-ENOMEM);
+		}
 	}
 
 	event->clock = &local_clock;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v4] perf/core: Fix sampling period inconsistency across CPU migration
  2026-04-29  9:51 [PATCH v4] perf/core: Fix sampling period inconsistency across CPU migration Minwoo Ahn
@ 2026-05-04  8:08 ` Peter Zijlstra
  2026-05-04 13:52   ` Minwoo Ahn
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2026-05-04  8:08 UTC (permalink / raw)
  To: Minwoo Ahn
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Jinkyu Jeong, linux-perf-users, linux-kernel

On Wed, Apr 29, 2026 at 09:51:34AM +0000, Minwoo Ahn wrote:
> 
> When per-task software events are sampled, period_left is not
> managed consistently when task migration happens. The perf_event
> may observe a different hw_perf_event::period_left on the new CPU,
> breaking the sampling periodicity. Even if a task was near its
> sampling point, it would use a stale period_left after migration.

How? This is just vague words, not actually saying anything of
substance.

> Introduce struct perf_task_context as a per-task container to

How can you propose a solution to a non-defined problem?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v4] perf/core: Fix sampling period inconsistency across CPU migration
  2026-05-04  8:08 ` Peter Zijlstra
@ 2026-05-04 13:52   ` Minwoo Ahn
  0 siblings, 0 replies; 3+ messages in thread
From: Minwoo Ahn @ 2026-05-04 13:52 UTC (permalink / raw)
  To: peterz
  Cc: acme, adrian.hunter, alexander.shishkin, irogers, james.clark,
	jinkyu, jolsa, linux-kernel, linux-perf-users, mark.rutland,
	mingo, mwahn402, namhyung

On Mon, May 04, 2026 at 10:08:32AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 29, 2026 at 09:51:34AM +0000, Minwoo Ahn wrote:
> > 
> > When per-task software events are sampled, period_left is not
> > managed consistently when task migration happens. The perf_event
> > may observe a different hw_perf_event::period_left on the new CPU,
> > breaking the sampling periodicity. Even if a task was near its
> > sampling point, it would use a stale period_left after migration.
> 
> How? This is just vague words, not actually saying anything of
> substance.
> 
> > Introduce struct perf_task_context as a per-task container to
> 
> How can you propose a solution to a non-defined problem?

Let me describe the problem more concretely.

A common per-task sampling invocation such as
`perf record -e task-clock -c 1000000 ./a.out` or
`perf record -t TID` opens, unless the user constrains it with `-C`,
one perf_event for each online CPU on the system. So a single target
task ends up with N distinct perf_event objects, each carrying its own
hw_perf_event::period_left that only counts down while the task is
running on the CPU that object is bound to.

When the task migrates from CPU_X to CPU_Y, the event on CPU_X is
scheduled out and the event on CPU_Y is scheduled in. They are
distinct objects, so the partial period accumulated on CPU_X stays in
that object and is dropped, and CPU_Y's event resumes from whatever
period_left it last held -- typically a full sample_period left over
from when its hrtimer was last armed.

Timeline (sample_period = 1.0s, migrate at t=0.6s):

  t=0.0s   on CPU0; event_CPU0->period_left = 1.0s
  t=0.6s   migrate CPU0 -> CPU1
           event_CPU0 sched_out, period_left = 0.4s
           event_CPU1 sched_in,  period_left = 1.0s   (never decremented)
  t=1.6s   first sample fires on CPU1   (task expected one at t=1.0s)

So sample intervals seen by the task no longer correspond to
sample_period.

The example above uses task-clock for clarity (sample_period is in
nanoseconds and period_left counts down with on-CPU time), but the
same mechanism applies to any per-task software sampling event: each
per-CPU object holds its own period_left and the partial progress on
the source CPU is not transferred to the destination CPU on migration.
What is dropped on migration is the unit each event counts in
(occurrences for non-clock sw events, time for clock events).

`perf record --per-thread` avoids this by opening a single
perf_event with cpu=-1, but it also disables inheritance, so it
cannot sample threads spawned after the run starts. The default
per-CPU mode is the one that supports inheritance, and it is the
mode where the inconsistency above is observed.

The per-task period_left should remain consistent from the task's
point of view across migrations, and that is the motivation behind
this patch.

Thanks,
Minwoo

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-05-04 13:53 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29  9:51 [PATCH v4] perf/core: Fix sampling period inconsistency across CPU migration Minwoo Ahn
2026-05-04  8:08 ` Peter Zijlstra
2026-05-04 13:52   ` Minwoo Ahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox