Re: [RFC] perf_events: exclude_guest impact on time_enabled/time_running

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Liang, Kan" <kan.liang@linux.intel.com>
To: Stephane Eranian <eranian@google.com>,
	Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Ian Rogers <irogers@google.com>,
	"Liang, Kan" <kan.liang@intel.com>,
	Andi Kleen <ak@linux.intel.com>, Ingo Molnar <mingo@redhat.com>,
	"Narayan, Ananth" <ananth.narayan@amd.com>,
	"Bangoria, Ravikumar" <ravi.bangoria@amd.com>,
	Namhyung Kim <namhyung@google.com>,
	Mingwei Zhang <mizhang@google.com>,
	Dapeng Mi <dapeng1.mi@linux.intel.com>,
	Zhang Xiong <xiong.y.zhang@intel.com>
Subject: Re: [RFC] perf_events: exclude_guest impact on time_enabled/time_running
Date: Thu, 6 Jun 2024 11:48:33 -0400	[thread overview]
Message-ID: <4622060b-b758-4629-9aa4-cc8334111be0@linux.intel.com> (raw)
In-Reply-To: <CABPqkBQ3LQ_dXQSQVSrriinvSSXm2fHx4yOms=jRsa2WaXSsog@mail.gmail.com>



On 2024-06-06 3:57 a.m., Stephane Eranian wrote:
> Hi Peter,
> 
> In the context of the new vPMU passthru patch series, we have to look
> closer at the definition and implementation of the exclude_guest
> filter in the perf_event_attr structure. This filter has been in the
> kernel for many years. See patch:
> https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@google.com/
> 
> The presumed  definition of the filter is that the user does not want
> the event to count while the processor is running in guest mode (i.e.,
> inside the virtual machine guest OS or guest user code).
> 
> The perf tool sets is by default on all core PMU events:
> $ perf stat -vv -e cycles sleep 0
> ------------------------------------------------------------
> perf_event_attr:
>   size                             112
>   sample_type                      IDENTIFIER
>   read_format                      TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
>   disabled                         1
>   inherit                          1
>   enable_on_exec                   1
>   exclude_guest                    1
> ------------------------------------------------------------
> 
> In the kernel, the way this is treated differs between AMD and Intel because AMD
> does provide a hardware filter for guest vs. host in the PMU counters
> whereas Intel
> does not. For the latter, the  kernel simply disables the event in the
> hardware counters,
> i.e., the event is not descheduled.  Both approaches produce pretty
> much the same
> desired effect, the event is not counted while in guest mode.
> 
> The issue I would like to raise has to do with the effects on
> time_enabled and time_running
> for exclude_guest=1 events.
> 
> Given the event is not scheduled out while in guest mode, even though
> it is stopped, both time_enabled and time_running continue ticking
> while in guest mode.  If a measurement is 10s
> long but only 5s are in non-guest mode, then time_enabled=10s,
> time_running=10s. The count
> represents 10s worth of non guest mode, of which only 5s were really
> actively monitoring, but
> the user has no way of determining this.


For the latest design/implementation, only the exclude_guest=1 host
event can be successfully created for the case.
The end user should not expect that anything is collected in the guest
mode. So both the time_enabled and the time_running will be 5s.

> 
> If we look at vPMU passthru, the host event must have exclude_guest=1
> to avoid going into
> an error state on context switch to the vCPU thread (with vPMU
> enabled). 

That's the design in V1. There is no error state anymore in V2 and later
as suggested by Sean.
https://lore.kernel.org/all/ZhmIrQQVgblrhCZs@google.com/

The VM cannot be created if there are host events with exclude_guest=0.
If a VM has been created, user cannot create an event with
exclude_guest=0. So nothing will be moved to an error state on context
switch to the vCPU thread.

> But this time,
> the event is scheduled out, that means that time_enabled keeps
> counting, but time_running
> stops. On context switch back in, the host event is scheduled again
> and time_running restarts
> ticking. For a 10s measurement, where 5s here in the guest, the event
> will come out with
> time_enabled=10s, time_running=5s, and the tool will scale it up
> because it thinks the event
> was multiplexed, when in fact it was not. This is not the intended
> outcome here. The tool should
> not scale the count, it was not multiplexed, it was descheduled
> because the filter forced it out.
> Note that if the event had been multiplexed while running on the host,
> then the scaling would be
> appropriate.

The scaling will not happen, since both time_enabled and time_running
should be the same when there are enough counter resources.

> 
> In that case, I argue, time_running should be updated to cover the
> time the event was not running. That would bring us back to the case I
> was describing earlier.
> 
> It boils down to the exact definition of exclude_guest and expected
> impact on time_enabled
> and time_running. Then, with or without vPMU passthru, we can fix the
> kernel to ensure a
> uniform behavior.

I think the time_enabled should be the one that has a controversial
definition.
Should the time in the guest mode count as the enabled time for an host
event that explicitly sets the exclude_guest=1?

I think the answer is NO. So I implemented it in the code.

> 
> What are your thoughts on this problem?
> 

Peter, please share your thoughts. We want to make sure the design is on
the right track.

Thanks,
Kan

next prev parent reply	other threads:[~2024-06-06 15:48 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-06  7:57 [RFC] perf_events: exclude_guest impact on time_enabled/time_running Stephane Eranian
2024-06-06 15:48 ` Liang, Kan [this message]
2024-06-06 17:08   ` Stephane Eranian
2024-06-06 17:33     ` Liang, Kan
2024-06-10 17:39       ` Liang, Kan
2024-06-10 20:54 ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4622060b-b758-4629-9aa4-cc8334111be0@linux.intel.com \
    --to=kan.liang@linux.intel.com \
    --cc=ak@linux.intel.com \
    --cc=ananth.narayan@amd.com \
    --cc=dapeng1.mi@linux.intel.com \
    --cc=eranian@google.com \
    --cc=irogers@google.com \
    --cc=kan.liang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=mizhang@google.com \
    --cc=namhyung@google.com \
    --cc=peterz@infradead.org \
    --cc=ravi.bangoria@amd.com \
    --cc=xiong.y.zhang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox