From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E1DF11DB366; Wed, 21 May 2025 19:47:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747856822; cv=none; b=gsNzh7n0wmM/bxEK/pmUuwPiQ3XeqPqI0/yQG9ct6e8QfSbpISDqcGbgr3SGLNRUj3tYcq/74G/6AVPr2yj2ciBrcVERAMdkDlEM5z76Lg0GusjfSjnWgpvbrNI+CNhso1Mj2VX4/b1Xl7yMBqURiwfpMxqOCDW2BUPp4Vx+a/U= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747856822; c=relaxed/simple; bh=KxpWvHTalFPpk1hq7k7Wbo/HtUjiTpk6l+sdAkRIWhU=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=lj1O85uRVOvg8zJ/hMq14A60M0UzhAatHnRhxy8XS5sc3HIGeFO7Fa5AG5e2TbwaJ2+1N3VinPRswr41Rh3/ko9GCvaSKZcu5kEwNBNgt/xVnXnqUC5ZQdt+FQZRJbrhZkJSScVTm9bVWxAT+AvS7mQfKugAPtI0Jn/I2gEgqGo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=JpZgLqQh; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="JpZgLqQh" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8E4B9C4CEE4; Wed, 21 May 2025 19:47:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1747856821; bh=KxpWvHTalFPpk1hq7k7Wbo/HtUjiTpk6l+sdAkRIWhU=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=JpZgLqQhf9Gd3fgj4H32L1DV7Ehmo+s9ISIGTeT8abVqeACe68Tb8B+Fh3OkLwG4Y OAuU3Y/kqJVEJAfMR3uSEMSLYo+CHg0Io2c7j3ZJE3ag+N8+UbJ988ttLsO64/vHXk P0VVh8DZzPild9O2L49j3wlFNKsUw4BO34/yZ65euEjZYv9zHDl4Rh9TLlm42W6oPb XmZz0/gPTQgjJqOBc+ZtgwnR/fiJHAc5/KMyKb03Y92fWZVUdBKbHzZdG9Txj3q0dL sW9E/ub3hLENJI6RwV7NJ3/e/MkTmkCWEAmIibj+zFEgHb+6u85FLslk6gSUB0SVJb wc0jjc5wpUiug== Date: Wed, 21 May 2025 12:46:59 -0700 From: Namhyung Kim To: Mingwei Zhang Cc: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Sean Christopherson , Paolo Bonzini , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Liang@google.com, Kan , "H. Peter Anvin" , linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org, Yongwei Ma , Xiong Zhang , Dapeng Mi , Jim Mattson , Sandipan Das , Zide Chen , Eranian Stephane , Shukla Manali , Nikunj Dadhania Subject: Re: [PATCH v4 04/38] perf: Add a EVENT_GUEST flag Message-ID: References: <20250324173121.1275209-1-mizhang@google.com> <20250324173121.1275209-5-mizhang@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20250324173121.1275209-5-mizhang@google.com> On Mon, Mar 24, 2025 at 05:30:44PM +0000, Mingwei Zhang wrote: > From: Kan Liang > > Current perf doesn't explicitly schedule out all exclude_guest events > while the guest is running. There is no problem with the current > emulated vPMU. Because perf owns all the PMU counters. It can mask the > counter which is assigned to an exclude_guest event when a guest is > running (Intel way), or set the corresponding HOSTONLY bit in evsentsel > (AMD way). The counter doesn't count when a guest is running. > > However, either way doesn't work with the introduced passthrough vPMU. > A guest owns all the PMU counters when it's running. The host should not > mask any counters. The counter may be used by the guest. The evsentsel > may be overwritten. > > Perf should explicitly schedule out all exclude_guest events to release > the PMU resources when entering a guest, and resume the counting when > exiting the guest. > > It's possible that an exclude_guest event is created when a guest is > running. The new event should not be scheduled in as well. > > The ctx time is shared among different PMUs. The time cannot be stopped > when a guest is running. It is required to calculate the time for events > from other PMUs, e.g., uncore events. Add timeguest to track the guest > run time. For an exclude_guest event, the elapsed time equals > the ctx time - guest time. > Cgroup has dedicated times. Use the same method to deduct the guest time > from the cgroup time as well. > > Co-developed-by: Peter Zijlstra (Intel) > Signed-off-by: Peter Zijlstra (Intel) > Signed-off-by: Kan Liang > Signed-off-by: Mingwei Zhang > --- > include/linux/perf_event.h | 6 ++ > kernel/events/core.c | 209 +++++++++++++++++++++++++++++-------- > 2 files changed, 169 insertions(+), 46 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index a2fd1bdc955c..7bda1e20be12 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -999,6 +999,11 @@ struct perf_event_context { > */ > struct perf_time_ctx time; > > + /* > + * Context clock, runs when in the guest mode. > + */ > + struct perf_time_ctx timeguest; > + > /* > * These fields let us detect when two contexts have both > * been cloned (inherited) from a common ancestor. > @@ -1089,6 +1094,7 @@ struct bpf_perf_event_data_kern { > */ > struct perf_cgroup_info { > struct perf_time_ctx time; > + struct perf_time_ctx timeguest; > int active; > }; > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index e38c8b5e8086..7a2115b2c5c1 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -163,7 +163,8 @@ enum event_type_t { > /* see ctx_resched() for details */ > EVENT_CPU = 0x10, > EVENT_CGROUP = 0x20, > - EVENT_FLAGS = EVENT_CGROUP, > + EVENT_GUEST = 0x40, > + EVENT_FLAGS = EVENT_CGROUP | EVENT_GUEST, > /* compound helpers */ > EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED, > EVENT_TIME_FROZEN = EVENT_TIME | EVENT_FROZEN, > @@ -435,6 +436,7 @@ static atomic_t nr_include_guest_events __read_mostly; > > static atomic_t nr_mediated_pmu_vms; > static DEFINE_MUTEX(perf_mediated_pmu_mutex); > +static DEFINE_PER_CPU(bool, perf_in_guest); > > /* !exclude_guest event of PMU with PERF_PMU_CAP_MEDIATED_VPMU */ > static inline bool is_include_guest_event(struct perf_event *event) > @@ -738,6 +740,9 @@ static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx, > { > if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups) > return true; > + if ((event_type & EVENT_GUEST) && > + !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU)) > + return true; > return false; > } > > @@ -788,6 +793,39 @@ static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, boo > WRITE_ONCE(time->offset, time->time - time->stamp); > } > > +static_assert(offsetof(struct perf_event_context, timeguest) - > + offsetof(struct perf_event_context, time) == > + sizeof(struct perf_time_ctx)); > + > +#define T_TOTAL 0 > +#define T_GUEST 1 > + > +static inline u64 __perf_event_time_ctx(struct perf_event *event, > + struct perf_time_ctx *times) > +{ > + u64 time = times[T_TOTAL].time; > + > + if (event->attr.exclude_guest) > + time -= times[T_GUEST].time; > + > + return time; > +} > + > +static inline u64 __perf_event_time_ctx_now(struct perf_event *event, > + struct perf_time_ctx *times, > + u64 now) > +{ > + if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) { > + /* > + * (now + times[total].offset) - (now + times[guest].offset) := > + * times[total].offset - times[guest].offset > + */ > + return READ_ONCE(times[T_TOTAL].offset) - READ_ONCE(times[T_GUEST].offset); So this will remove both time_enabled and time_running. I think it's fine as the events are not multiplexed, but some curious users may wonder why the time_enabled is less than expected. :) > + } > + > + return now + READ_ONCE(times[T_TOTAL].offset); > +} > + [SNIP] > @@ -6285,23 +6400,25 @@ void perf_event_update_userpage(struct perf_event *event) > if (!rb) > goto unlock; > > - /* > - * compute total_time_enabled, total_time_running > - * based on snapshot values taken when the event > - * was last scheduled in. > - * > - * we cannot simply called update_context_time() > - * because of locking issue as we can be called in > - * NMI context > - */ > - calc_timer_values(event, &now, &enabled, &running); > - > - userpg = rb->user_page; > /* > * Disable preemption to guarantee consistent time stamps are stored to > * the user page. > */ > preempt_disable(); > + > + /* > + * compute total_time_enabled, total_time_running > + * based on snapshot values taken when the event > + * was last scheduled in. > + * > + * we cannot simply called update_context_time() s/called/call. I know you just moved the code though. :) Thanks, Namhyung > + * because of locking issue as we can be called in > + * NMI context > + */ > + calc_timer_values(event, &now, &enabled, &running); > + > + userpg = rb->user_page; > + > ++userpg->lock; > barrier(); > userpg->index = perf_event_index(event); > -- > 2.49.0.395.g12beb8f557-goog >