linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Liang, Kan" <kan.liang@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Mingwei Zhang <mizhang@google.com>,
	Sean Christopherson <seanjc@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Xiong Zhang <xiong.y.zhang@intel.com>,
	Dapeng Mi <dapeng1.mi@linux.intel.com>,
	Kan Liang <kan.liang@intel.com>,
	Zhenyu Wang <zhenyuw@linux.intel.com>,
	Manali Shukla <manali.shukla@amd.com>,
	Sandipan Das <sandipan.das@amd.com>,
	Jim Mattson <jmattson@google.com>,
	Stephane Eranian <eranian@google.com>,
	Ian Rogers <irogers@google.com>,
	Namhyung Kim <namhyung@kernel.org>,
	gce-passthrou-pmu-dev@google.com,
	Samantha Alt <samantha.alt@intel.com>,
	Zhiyuan Lv <zhiyuan.lv@intel.com>,
	Yanfei Xu <yanfei.xu@intel.com>, maobibo <maobibo@loongson.cn>,
	Like Xu <like.xu.linux@gmail.com>,
	kvm@vger.kernel.org, linux-perf-users@vger.kernel.org
Subject: Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
Date: Tue, 11 Jun 2024 09:27:46 -0400	[thread overview]
Message-ID: <0a403a6c-8d55-42cb-a90c-c13e1458b45e@linux.intel.com> (raw)
In-Reply-To: <20240611120641.GF8774@noisy.programming.kicks-ass.net>



On 2024-06-11 8:06 a.m., Peter Zijlstra wrote:
> On Mon, Jun 10, 2024 at 01:23:04PM -0400, Liang, Kan wrote:
>> On 2024-05-07 4:58 a.m., Peter Zijlstra wrote:
> 
>>> Bah, this is a ton of copy-paste from the normal scheduling code with
>>> random changes. Why ?
>>>
>>> Why can't this use ctx_sched_{in,out}() ? Surely the whole
>>> CAP_PASSTHROUGHT thing is but a flag away.
>>>
>>
>> Not just a flag. The time has to be updated as well, since the ctx->time
>> is shared among PMUs. Perf shouldn't stop it while other PMUs is still
>> running.
> 
> Obviously the original changelog didn't mention any of that.... :/

Yes, the time issue was a newly found bug when we test the uncore PMUs.

> 
>> A timeguest will be introduced to track the start time of a guest.
>> The event->tstamp of an exclude_guest event should always keep
>> ctx->timeguest while a guest is running.
>> When a guest is leaving, update the event->tstamp to now, so the guest
>> time can be deducted.
>>
>> The below patch demonstrate how the timeguest works.
>> (It's an incomplete patch. Just to show the idea.)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 22d3e56682c9..2134e6886e22 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -953,6 +953,7 @@ struct perf_event_context {
>>  	u64				time;
>>  	u64				timestamp;
>>  	u64				timeoffset;
>> +	u64				timeguest;
>>
>>  	/*
>>  	 * These fields let us detect when two contexts have both
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 14fd881e3e1d..2aed56671a24 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -690,12 +690,31 @@ __perf_update_times(struct perf_event *event, u64
>> now, u64 *enabled, u64 *runnin
>>  		*running += delta;
>>  }
>>
>> +static void perf_event_update_time_guest(struct perf_event *event)
>> +{
>> +	/*
>> +	 * If a guest is running, use the timestamp while entering the guest.
>> +	 * If the guest is leaving, reset the event timestamp.
>> +	 */
>> +	if (!__this_cpu_read(perf_in_guest))
>> +		event->tstamp = event->ctx->time;
>> +	else
>> +		event->tstamp = event->ctx->timeguest;
>> +}
> 
> This conditional seems inverted, without a good reason. Also, in another
> thread you talk about some PMUs stopping time in a guest, while other
> PMUs would keep ticking. I don't think the above captures that.
> 
>>  static void perf_event_update_time(struct perf_event *event)
>>  {
>> -	u64 now = perf_event_time(event);
>> +	u64 now;
>> +
>> +	/* Never count the time of an active guest into an exclude_guest event. */
>> +	if (event->ctx->timeguest &&
>> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU)
>> +		return perf_event_update_time_guest(event);
> 
> Urgh, weird split. The PMU check is here. Please just inline the above
> here, this seems to be the sole caller anyway.
>

Sure

>>
>> +	now = perf_event_time(event);
>>  	__perf_update_times(event, now, &event->total_time_enabled,
>>  					&event->total_time_running);
>> +
>>  	event->tstamp = now;
>>  }
>>
>> @@ -3398,7 +3417,14 @@ ctx_sched_out(struct perf_event_context *ctx,
>> enum event_type_t event_type)
>>  			cpuctx->task_ctx = NULL;
>>  	}
>>
>> -	is_active ^= ctx->is_active; /* changed bits */
>> +	if (event_type & EVENT_GUEST) {
> 
> Patch doesn't introduce EVENT_GUEST, lost a hunk somewhere?

Sorry, there is a prerequisite patch to factor out the EVENT_CGROUP.
I thought it will be complex and confusion to paste both. Some details
are lost.
I will post both two patches at the end.

> 
>> +		/*
>> +		 * Schedule out all !exclude_guest events of PMU
>> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>> +		 */
>> +		is_active = EVENT_ALL;
>> +	} else
>> +		is_active ^= ctx->is_active; /* changed bits */
>>
>>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
> 
>> @@ -5894,14 +5933,18 @@ void perf_guest_enter(u32 guest_lvtpc)
>>  	}
>>
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>> -	ctx_sched_out(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST);
>> +	ctx_sched_out(&cpuctx->ctx, EVENT_GUEST);
>> +	/* Set the guest start time */
>> +	cpuctx->ctx.timeguest = cpuctx->ctx.time;
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>  	if (cpuctx->task_ctx) {
>>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>> -		task_ctx_sched_out(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST);
>> +		task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST);
>> +		cpuctx->task_ctx->timeguest = cpuctx->task_ctx->time;
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
>>
>>  	__this_cpu_write(perf_in_guest, true);
>> @@ -5925,14 +5968,17 @@ void perf_guest_exit(void)
>>
>>  	__this_cpu_write(perf_in_guest, false);
>>
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>> -	ctx_sched_in(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST);
>> +	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>> +	cpuctx->ctx.timeguest = 0;
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>  	if (cpuctx->task_ctx) {
>>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>> -		ctx_sched_in(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST);
>> +		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
>> +		cpuctx->task_ctx->timeguest = 0;
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
> 
> I'm thinking EVENT_GUEST should cause the ->timeguest updates, no point
> in having them explicitly duplicated here, hmm?

We have to add a EVENT_GUEST check and update the ->timeguest at the end
of the ctx_sched_out/in functions after the pmu_ctx_sched_out/in().
Because the ->timeguest also be used to check if it's leaving the guest
in the perf_event_update_time().

Since the EVENT_GUEST only be used by perf_guest_enter/exit(), I thought
maybe it's better to move it to where it is used rather than the generic
ctx_sched_out/in(). It will minimize the impact on the
non-virtualization user.

Here are the two complete patches.

The first one is to factor out the EVENT_CGROUP.


From c508f2b0e11a2eea71fe3ff16d1d848359ede535 Mon Sep 17 00:00:00 2001
From: Kan Liang <kan.liang@linux.intel.com>
Date: Mon, 27 May 2024 06:58:29 -0700
Subject: [PATCH 1/2] perf: Skip pmu_ctx based on event_type

To optimize the cgroup context switch, the perf_event_pmu_context
iteration skips the PMUs without cgroup events. A bool cgroup was
introduced to indicate the case. It can work, but this way is hard to
extend for other cases, e.g. skipping non-passthrough PMUs. It doesn't
make sense to keep adding bool variables.

Pass the event_type instead of the specific bool variable. Check both
the event_type and related pmu_ctx variables to decide whether skipping
a PMU.

Event flags, e.g., EVENT_CGROUP, should be cleard in the ctx->is_active.
Add EVENT_FLAGS to indicate such event flags.

No functional change.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 kernel/events/core.c | 70 +++++++++++++++++++++++---------------------
 1 file changed, 37 insertions(+), 33 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index abd4027e3859..95d1d5a5addc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -376,6 +376,7 @@ enum event_type_t {
 	/* see ctx_resched() for details */
 	EVENT_CPU = 0x8,
 	EVENT_CGROUP = 0x10,
+	EVENT_FLAGS = EVENT_CGROUP,
 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
 };

@@ -699,23 +700,32 @@ do {									\
 	___p;								\
 })

-static void perf_ctx_disable(struct perf_event_context *ctx, bool cgroup)
+static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
+			      enum event_type_t event_type)
+{
+	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
+		return true;
+
+	return false;
+}
+
+static void perf_ctx_disable(struct perf_event_context *ctx, enum
event_type_t event_type)
 {
 	struct perf_event_pmu_context *pmu_ctx;

 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		perf_pmu_disable(pmu_ctx->pmu);
 	}
 }

-static void perf_ctx_enable(struct perf_event_context *ctx, bool cgroup)
+static void perf_ctx_enable(struct perf_event_context *ctx, enum
event_type_t event_type)
 {
 	struct perf_event_pmu_context *pmu_ctx;

 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		perf_pmu_enable(pmu_ctx->pmu);
 	}
@@ -877,7 +887,7 @@ static void perf_cgroup_switch(struct task_struct *task)
 		return;

 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-	perf_ctx_disable(&cpuctx->ctx, true);
+	perf_ctx_disable(&cpuctx->ctx, EVENT_CGROUP);

 	ctx_sched_out(&cpuctx->ctx, EVENT_ALL|EVENT_CGROUP);
 	/*
@@ -893,7 +903,7 @@ static void perf_cgroup_switch(struct task_struct *task)
 	 */
 	ctx_sched_in(&cpuctx->ctx, EVENT_ALL|EVENT_CGROUP);

-	perf_ctx_enable(&cpuctx->ctx, true);
+	perf_ctx_enable(&cpuctx->ctx, EVENT_CGROUP);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 }

@@ -2729,9 +2739,9 @@ static void ctx_resched(struct perf_cpu_context
*cpuctx,

 	event_type &= EVENT_ALL;

-	perf_ctx_disable(&cpuctx->ctx, false);
+	perf_ctx_disable(&cpuctx->ctx, 0);
 	if (task_ctx) {
-		perf_ctx_disable(task_ctx, false);
+		perf_ctx_disable(task_ctx, 0);
 		task_ctx_sched_out(task_ctx, event_type);
 	}

@@ -2749,9 +2759,9 @@ static void ctx_resched(struct perf_cpu_context
*cpuctx,

 	perf_event_sched_in(cpuctx, task_ctx);

-	perf_ctx_enable(&cpuctx->ctx, false);
+	perf_ctx_enable(&cpuctx->ctx, 0);
 	if (task_ctx)
-		perf_ctx_enable(task_ctx, false);
+		perf_ctx_enable(task_ctx, 0);
 }

 void perf_pmu_resched(struct pmu *pmu)
@@ -3296,9 +3306,6 @@ ctx_sched_out(struct perf_event_context *ctx, enum
event_type_t event_type)
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
 	struct perf_event_pmu_context *pmu_ctx;
 	int is_active = ctx->is_active;
-	bool cgroup = event_type & EVENT_CGROUP;
-
-	event_type &= ~EVENT_CGROUP;

 	lockdep_assert_held(&ctx->lock);

@@ -3333,7 +3340,7 @@ ctx_sched_out(struct perf_event_context *ctx, enum
event_type_t event_type)
 		barrier();
 	}

-	ctx->is_active &= ~event_type;
+	ctx->is_active &= ~(event_type & ~EVENT_FLAGS);
 	if (!(ctx->is_active & EVENT_ALL))
 		ctx->is_active = 0;

@@ -3346,7 +3353,7 @@ ctx_sched_out(struct perf_event_context *ctx, enum
event_type_t event_type)
 	is_active ^= ctx->is_active; /* changed bits */

 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		__pmu_ctx_sched_out(pmu_ctx, is_active);
 	}
@@ -3540,7 +3547,7 @@ perf_event_context_sched_out(struct task_struct
*task, struct task_struct *next)
 		raw_spin_lock_nested(&next_ctx->lock, SINGLE_DEPTH_NESTING);
 		if (context_equiv(ctx, next_ctx)) {

-			perf_ctx_disable(ctx, false);
+			perf_ctx_disable(ctx, 0);

 			/* PMIs are disabled; ctx->nr_pending is stable. */
 			if (local_read(&ctx->nr_pending) ||
@@ -3560,7 +3567,7 @@ perf_event_context_sched_out(struct task_struct
*task, struct task_struct *next)
 			perf_ctx_sched_task_cb(ctx, false);
 			perf_event_swap_task_ctx_data(ctx, next_ctx);

-			perf_ctx_enable(ctx, false);
+			perf_ctx_enable(ctx, 0);

 			/*
 			 * RCU_INIT_POINTER here is safe because we've not
@@ -3584,13 +3591,13 @@ perf_event_context_sched_out(struct task_struct
*task, struct task_struct *next)

 	if (do_switch) {
 		raw_spin_lock(&ctx->lock);
-		perf_ctx_disable(ctx, false);
+		perf_ctx_disable(ctx, 0);

 inside_switch:
 		perf_ctx_sched_task_cb(ctx, false);
 		task_ctx_sched_out(ctx, EVENT_ALL);

-		perf_ctx_enable(ctx, false);
+		perf_ctx_enable(ctx, 0);
 		raw_spin_unlock(&ctx->lock);
 	}
 }
@@ -3887,12 +3894,12 @@ static void pmu_groups_sched_in(struct
perf_event_context *ctx,

 static void ctx_groups_sched_in(struct perf_event_context *ctx,
 				struct perf_event_groups *groups,
-				bool cgroup)
+				enum event_type_t event_type)
 {
 	struct perf_event_pmu_context *pmu_ctx;

 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
 	}
@@ -3909,9 +3916,6 @@ ctx_sched_in(struct perf_event_context *ctx, enum
event_type_t event_type)
 {
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
 	int is_active = ctx->is_active;
-	bool cgroup = event_type & EVENT_CGROUP;
-
-	event_type &= ~EVENT_CGROUP;

 	lockdep_assert_held(&ctx->lock);

@@ -3929,7 +3933,7 @@ ctx_sched_in(struct perf_event_context *ctx, enum
event_type_t event_type)
 		barrier();
 	}

-	ctx->is_active |= (event_type | EVENT_TIME);
+	ctx->is_active |= ((event_type & ~EVENT_FLAGS) | EVENT_TIME);
 	if (ctx->task) {
 		if (!is_active)
 			cpuctx->task_ctx = ctx;
@@ -3944,11 +3948,11 @@ ctx_sched_in(struct perf_event_context *ctx,
enum event_type_t event_type)
 	 * in order to give them the best chance of going on.
 	 */
 	if (is_active & EVENT_PINNED)
-		ctx_groups_sched_in(ctx, &ctx->pinned_groups, cgroup);
+		ctx_groups_sched_in(ctx, &ctx->pinned_groups, event_type);

 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE)
-		ctx_groups_sched_in(ctx, &ctx->flexible_groups, cgroup);
+		ctx_groups_sched_in(ctx, &ctx->flexible_groups, event_type);
 }

 static void perf_event_context_sched_in(struct task_struct *task)
@@ -3963,11 +3967,11 @@ static void perf_event_context_sched_in(struct
task_struct *task)

 	if (cpuctx->task_ctx == ctx) {
 		perf_ctx_lock(cpuctx, ctx);
-		perf_ctx_disable(ctx, false);
+		perf_ctx_disable(ctx, 0);

 		perf_ctx_sched_task_cb(ctx, true);

-		perf_ctx_enable(ctx, false);
+		perf_ctx_enable(ctx, 0);
 		perf_ctx_unlock(cpuctx, ctx);
 		goto rcu_unlock;
 	}
@@ -3980,7 +3984,7 @@ static void perf_event_context_sched_in(struct
task_struct *task)
 	if (!ctx->nr_events)
 		goto unlock;

-	perf_ctx_disable(ctx, false);
+	perf_ctx_disable(ctx, 0);
 	/*
 	 * We want to keep the following priority order:
 	 * cpu pinned (that don't need to move), task pinned,
@@ -3990,7 +3994,7 @@ static void perf_event_context_sched_in(struct
task_struct *task)
 	 * events, no need to flip the cpuctx's events around.
 	 */
 	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree)) {
-		perf_ctx_disable(&cpuctx->ctx, false);
+		perf_ctx_disable(&cpuctx->ctx, 0);
 		ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
 	}

@@ -3999,9 +4003,9 @@ static void perf_event_context_sched_in(struct
task_struct *task)
 	perf_ctx_sched_task_cb(cpuctx->task_ctx, true);

 	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
-		perf_ctx_enable(&cpuctx->ctx, false);
+		perf_ctx_enable(&cpuctx->ctx, 0);

-	perf_ctx_enable(ctx, false);
+	perf_ctx_enable(ctx, 0);

 unlock:
 	perf_ctx_unlock(cpuctx, ctx);


Here are the second patch to introduce EVENT_GUEST and
perf_guest_exit/enter().


From c052e673dc3e0e7ebacdce23f2b0d50ec98401b3 Mon Sep 17 00:00:00 2001
From: Kan Liang <kan.liang@linux.intel.com>
Date: Tue, 11 Jun 2024 06:04:20 -0700
Subject: [PATCH 2/2] perf: Add generic exclude_guest support

Current perf doesn't explicitly schedule out all exclude_guest events
while the guest is running. There is no problem with the current
emulated vPMU. Because perf owns all the PMU counters. It can mask the
counter which is assigned to an exclude_guest event when a guest is
running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
(AMD way). The counter doesn't count when a guest is running.

However, either way doesn't work with the introduced passthrough vPMU.
A guest owns all the PMU counters when it's running. The host should not
mask any counters. The counter may be used by the guest. The evsentsel
may be overwritten.

Perf should explicitly schedule out all exclude_guest events to release
the PMU resources when entering a guest, and resume the counting when
exiting the guest.

Expose two interfaces to KVM. The KVM should notify the perf when
entering/exiting a guest.

Introduce a new event type EVENT_CGUEST to indicate that perf should
check and skip the PMUs which doesn't support the passthrough mode.

It's possible that an exclude_guest event is created when a guest is
running. The new event should not be scheduled in as well.

The ctx->time is used to calculated the running/enabling time of an
event, which is shared among PMUs. The ctx_sched_in/out() with
EVENT_CGUEST doesn't stop the ctx->time. A timeguest is introduced to
track the start time of a guest. For an exclude_guest event, the time in
the guest mode is deducted.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h |   5 ++
 kernel/events/core.c       | 119 +++++++++++++++++++++++++++++++++++--
 2 files changed, 120 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index dd4920bf3d1b..68c8b93c4e5c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -945,6 +945,7 @@ struct perf_event_context {
 	u64				time;
 	u64				timestamp;
 	u64				timeoffset;
+	u64				timeguest;

 	/*
 	 * These fields let us detect when two contexts have both
@@ -1734,6 +1735,8 @@ extern int perf_event_period(struct perf_event
*event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
 extern int perf_get_mediated_pmu(void);
 extern void perf_put_mediated_pmu(void);
+void perf_guest_enter(void);
+void perf_guest_exit(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1826,6 +1829,8 @@ static inline int perf_get_mediated_pmu(void)
 }

 static inline void perf_put_mediated_pmu(void)			{ }
+static inline void perf_guest_enter(void)			{ }
+static inline void perf_guest_exit(void)			{ }
 #endif

 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 95d1d5a5addc..cd3a89672b14 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -376,7 +376,8 @@ enum event_type_t {
 	/* see ctx_resched() for details */
 	EVENT_CPU = 0x8,
 	EVENT_CGROUP = 0x10,
-	EVENT_FLAGS = EVENT_CGROUP,
+	EVENT_GUEST = 0x20,
+	EVENT_FLAGS = EVENT_CGROUP | EVENT_GUEST,
 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
 };

@@ -407,6 +408,7 @@ static atomic_t nr_include_guest_events __read_mostly;

 static atomic_t nr_mediated_pmu_vms;
 static DEFINE_MUTEX(perf_mediated_pmu_mutex);
+static DEFINE_PER_CPU(bool, perf_in_guest);

 /* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
 static inline bool is_include_guest_event(struct perf_event *event)
@@ -651,10 +653,26 @@ __perf_update_times(struct perf_event *event, u64
now, u64 *enabled, u64 *runnin

 static void perf_event_update_time(struct perf_event *event)
 {
-	u64 now = perf_event_time(event);
+	u64 now;
+
+	/* Never count the time of an active guest into an exclude_guest event. */
+	if (event->ctx->timeguest &&
+	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
+		/*
+		 * If a guest is running, use the timestamp while entering the guest.
+		 * If the guest is leaving, reset the event timestamp.
+		 */
+		if (__this_cpu_read(perf_in_guest))
+			event->tstamp = event->ctx->timeguest;
+		else
+			event->tstamp = event->ctx->time;
+		return;
+	}

+	now = perf_event_time(event);
 	__perf_update_times(event, now, &event->total_time_enabled,
 					&event->total_time_running);
+
 	event->tstamp = now;
 }

@@ -706,6 +724,10 @@ static bool perf_skip_pmu_ctx(struct
perf_event_pmu_context *pmu_ctx,
 	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
 		return true;

+	if ((event_type & EVENT_GUEST) &&
+	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
+		return true;
+
 	return false;
 }

@@ -3350,7 +3372,14 @@ ctx_sched_out(struct perf_event_context *ctx,
enum event_type_t event_type)
 			cpuctx->task_ctx = NULL;
 	}

-	is_active ^= ctx->is_active; /* changed bits */
+	if (event_type & EVENT_GUEST) {
+		/*
+		 * Schedule out all !exclude_guest events of PMU
+		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
+		 */
+		is_active = EVENT_ALL;
+	} else
+		is_active ^= ctx->is_active; /* changed bits */

 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
 		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
@@ -3860,6 +3889,15 @@ static int merge_sched_in(struct perf_event
*event, void *data)
 	if (!event_filter_match(event))
 		return 0;

+	/*
+	 * Don't schedule in any exclude_guest events of PMU with
+	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
+	 */
+	if (__this_cpu_read(perf_in_guest) &&
+	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
+	    event->attr.exclude_guest)
+		return 0;
+
 	if (group_can_go_on(event, *can_add_hw)) {
 		if (!group_sched_in(event, ctx))
 			list_add_tail(&event->active_list, get_event_list(event));
@@ -3941,7 +3979,20 @@ ctx_sched_in(struct perf_event_context *ctx, enum
event_type_t event_type)
 			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
 	}

-	is_active ^= ctx->is_active; /* changed bits */
+	if (event_type & EVENT_GUEST) {
+		/*
+		 * Schedule in all !exclude_guest events of PMU
+		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
+		 */
+		is_active = EVENT_ALL;
+
+		/*
+		 * Update ctx time to set the new start time for
+		 * the exclude_guest events.
+		 */
+		update_context_time(ctx);
+	} else
+		is_active ^= ctx->is_active; /* changed bits */

 	/*
 	 * First go through the list and put on any pinned groups
@@ -5788,6 +5839,66 @@ void perf_put_mediated_pmu(void)
 }
 EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);

+/* When entering a guest, schedule out all exclude_guest events. */
+void perf_guest_enter(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest))) {
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		return;
+	}
+
+	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
+	ctx_sched_out(&cpuctx->ctx, EVENT_GUEST);
+	/* Set the guest start time */
+	cpuctx->ctx.timeguest = cpuctx->ctx.time;
+	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
+	if (cpuctx->task_ctx) {
+		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
+		task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST);
+		cpuctx->task_ctx->timeguest = cpuctx->task_ctx->time;
+		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
+	}
+
+	__this_cpu_write(perf_in_guest, true);
+
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+
+void perf_guest_exit(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest))) {
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		return;
+	}
+
+	__this_cpu_write(perf_in_guest, false);
+
+	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
+	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
+	cpuctx->ctx.timeguest = 0;
+	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
+	if (cpuctx->task_ctx) {
+		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
+		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
+		cpuctx->task_ctx->timeguest = 0;
+		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
+	}
+
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+
 /*
  * Holding the top-level event's child_mutex means that any
  * descendant process that has inherited this event will block


Thanks,
Kan


  reply	other threads:[~2024-06-11 13:27 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 01/54] KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET" Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 02/54] KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 03/54] KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 04/54] x86/msr: Define PerfCntrGlobalStatusSet register Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 05/54] x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces Mingwei Zhang
2024-05-07  8:31   ` Peter Zijlstra
2024-05-08  4:13     ` Zhang, Xiong Y
2024-05-07  8:41   ` Peter Zijlstra
2024-05-08  4:54     ` Zhang, Xiong Y
2024-05-08  8:32       ` Peter Zijlstra
2024-05-06  5:29 ` [PATCH v2 07/54] perf: Add generic exclude_guest support Mingwei Zhang
2024-05-07  8:58   ` Peter Zijlstra
2024-06-10 17:23     ` Liang, Kan
2024-06-11 12:06       ` Peter Zijlstra
2024-06-11 13:27         ` Liang, Kan [this message]
2024-06-12 11:17           ` Peter Zijlstra
2024-06-12 13:38             ` Liang, Kan
2024-06-13  9:15               ` Peter Zijlstra
2024-06-13 13:37                 ` Liang, Kan
2024-06-13 18:04                   ` Liang, Kan
2024-06-17  7:51                     ` Peter Zijlstra
2024-06-17 13:34                       ` Liang, Kan
2024-06-17 15:00                         ` Peter Zijlstra
2024-06-17 15:45                           ` Liang, Kan
2024-05-06  5:29 ` [PATCH v2 08/54] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 09/54] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
2024-05-07  9:12   ` Peter Zijlstra
2024-05-08 10:06     ` Yanfei Xu
2024-05-06  5:29 ` [PATCH v2 10/54] KVM: x86: Extract x86_set_kvm_irq_handler() function Mingwei Zhang
2024-05-07  9:18   ` Peter Zijlstra
2024-05-08  8:57     ` Zhang, Xiong Y
2024-05-06  5:29 ` [PATCH v2 11/54] KVM: x86/pmu: Register guest pmi handler for emulated PMU Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler Mingwei Zhang
2024-05-07  9:22   ` Peter Zijlstra
2024-05-08  6:58     ` Zhang, Xiong Y
2024-05-08  8:37       ` Peter Zijlstra
2024-05-09  7:30         ` Zhang, Xiong Y
2024-05-07 21:40   ` Chen, Zide
2024-05-08  3:44     ` Mi, Dapeng
2024-05-30  5:12     ` Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 13/54] perf: core/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
2024-05-07  9:33   ` Peter Zijlstra
2024-05-09  7:39     ` Zhang, Xiong Y
2024-05-06  5:29 ` [PATCH v2 14/54] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 15/54] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 16/54] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode Mingwei Zhang
2024-05-08  4:18   ` Mi, Dapeng
2024-05-08  4:36     ` Mingwei Zhang
2024-05-08  6:27       ` Mi, Dapeng
2024-05-08 14:13         ` Sean Christopherson
2024-05-09  0:13           ` Mingwei Zhang
2024-05-09  0:30             ` Mi, Dapeng
2024-05-09  0:38           ` Mi, Dapeng
2024-05-06  5:29 ` [PATCH v2 18/54] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 19/54] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init() Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 20/54] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
2024-05-08 21:55   ` Chen, Zide
2024-05-30  5:20     ` Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 21/54] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 22/54] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 23/54] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 24/54] KVM: x86/pmu: Create a function prototype to disable MSR interception Mingwei Zhang
2024-05-08 22:03   ` Chen, Zide
2024-05-30  5:24     ` Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 25/54] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU Mingwei Zhang
2024-05-08 21:48   ` Chen, Zide
2024-05-09  0:43     ` Mi, Dapeng
2024-05-09  1:29       ` Chen, Zide
2024-05-09  2:58         ` Mi, Dapeng
2024-05-30  5:28           ` Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 27/54] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
2024-05-14  7:33   ` Mi, Dapeng
2024-05-06  5:29 ` [PATCH v2 28/54] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 29/54] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 30/54] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Mingwei Zhang
2024-05-14  8:08   ` Mi, Dapeng
2024-05-30  5:34     ` Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 31/54] KVM: x86/pmu: Make check_pmu_event_filter() an exported function Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 32/54] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 33/54] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 34/54] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 35/54] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 36/54] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary Mingwei Zhang
2024-07-10  8:37   ` Sandipan Das
2024-07-10 10:01     ` Zhang, Xiong Y
2024-07-10 12:30       ` Sandipan Das
2024-05-06  5:30 ` [PATCH v2 37/54] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 38/54] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch Mingwei Zhang
2024-05-07  9:39   ` Peter Zijlstra
2024-05-08  4:22     ` Mi, Dapeng
2024-05-30  4:34     ` Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 39/54] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 40/54] KVM: x86/pmu: Introduce PMU operator to increment counter Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 41/54] KVM: x86/pmu: Introduce PMU operator for setting counter overflow Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Mingwei Zhang
2024-05-08 18:28   ` Chen, Zide
2024-05-09  1:11     ` Mi, Dapeng
2024-05-30  4:20       ` Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 43/54] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 44/54] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 45/54] KVM: nVMX: Add nested virtualization support for " Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 46/54] perf/x86/amd/core: Set passthrough capability for host Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 47/54] KVM: x86/pmu/svm: Set passthrough capability for vcpus Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 48/54] KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 49/54] KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 50/54] KVM: x86/pmu/svm: Implement callback to disable MSR interception Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 51/54] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 52/54] KVM: x86/pmu/svm: Add registers to direct access list Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 53/54] KVM: x86/pmu/svm: Implement handlers to save and restore context Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 54/54] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU Mingwei Zhang
2024-05-28  2:35 ` [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Ma, Yongwei
2024-05-30  4:28   ` Mingwei Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0a403a6c-8d55-42cb-a90c-c13e1458b45e@linux.intel.com \
    --to=kan.liang@linux.intel.com \
    --cc=dapeng1.mi@linux.intel.com \
    --cc=eranian@google.com \
    --cc=gce-passthrou-pmu-dev@google.com \
    --cc=irogers@google.com \
    --cc=jmattson@google.com \
    --cc=kan.liang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=like.xu.linux@gmail.com \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=manali.shukla@amd.com \
    --cc=maobibo@loongson.cn \
    --cc=mizhang@google.com \
    --cc=namhyung@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=samantha.alt@intel.com \
    --cc=sandipan.das@amd.com \
    --cc=seanjc@google.com \
    --cc=xiong.y.zhang@intel.com \
    --cc=yanfei.xu@intel.com \
    --cc=zhenyuw@linux.intel.com \
    --cc=zhiyuan.lv@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).