Re: [PATCH v2] Documentation: KVM: Add vPMU implementaion and gap document

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
To: Xiong Zhang <xiong.y.zhang@intel.com>, kvm@vger.kernel.org
Cc: weijiang.yang@intel.com, seanjc@google.com,
	like.xu.linux@gmail.com, zhiyuan.lv@intel.com,
	zhenyu.z.wang@intel.com, kan.liang@intel.com
Subject: Re: [PATCH v2] Documentation: KVM: Add vPMU implementaion and gap document
Date: Wed, 2 Aug 2023 13:57:29 +0800	[thread overview]
Message-ID: <e0e478bd-282b-68fa-7c94-8efbcc450750@linux.intel.com> (raw)
In-Reply-To: <20230801035836.1048879-1-xiong.y.zhang@intel.com>

> Add a vPMU implementation and gap document to explain vArch PMU and vLBR
> implementation in kvm, especially the current gap to support host and
> guest perf event coexist.
>
> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> ---
> Changelog:
> v1 -> v2:
> * Refactor perf scheduler section
> * Correct one sentence in vArch PMU section
> ---
>   Documentation/virt/kvm/x86/index.rst |   1 +
>   Documentation/virt/kvm/x86/pmu.rst   | 325 +++++++++++++++++++++++++++
>   2 files changed, 326 insertions(+)
>   create mode 100644 Documentation/virt/kvm/x86/pmu.rst
>
> diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
> index 9ece6b8dc817..02c1c7b01bf3 100644
> --- a/Documentation/virt/kvm/x86/index.rst
> +++ b/Documentation/virt/kvm/x86/index.rst
> @@ -14,5 +14,6 @@ KVM for x86 systems
>      mmu
>      msr
>      nested-vmx
> +   pmu
>      running-nested-guests
>      timekeeping
> diff --git a/Documentation/virt/kvm/x86/pmu.rst b/Documentation/virt/kvm/x86/pmu.rst
> new file mode 100644
> index 000000000000..1b83390bacbf
> --- /dev/null
> +++ b/Documentation/virt/kvm/x86/pmu.rst
> @@ -0,0 +1,325 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========================
> +PMU virtualization for X86
> +==========================
> +
> +:Author: Xiong Zhang <xiong.y.zhang@intel.com>
> +:Copyright: (c) 2023, Intel.  All rights reserved.
> +
> +.. Contents
> +
> +1. Overview
> +2. Perf Scheduler Basic
> +3. Arch PMU virtualization
> +4. LBR virtualization
> +
> +1. Overview
> +===========
> +
> +KVM has supported PMU virtualization on x86 for many years and provides
> +MSR based Arch PMU interface to the guest. The major features include
> +Arch PMU v2, LBR and PEBS. Users have the same operation to profile
> +performance in guest and host.
> +KVM is a normal perf subsystem user as other perf subsystem users. When
> +the guest access vPMU MSRs, KVM traps it and creates a perf event for it.
> +This perf event takes part in perf scheduler to request PMU resources
> +and let the guest use these resources.
> +
> +This document describes the X86 PMU virtualization architecture design
> +and opens. It is organized as follows: Next section describes more
> +details of Linux perf scheduler as it takes a key role in vPMU
> +implementation and allocates PMU resources for guest usage. Then Arch
> +PMU virtualization and LBR virtualization are introduced, each feature
> +has sections to introduce implementation overview,  the expectation and
> +gaps when host and guest perf events coexist.
> +
> +2. Perf Scheduler Basic
> +=======================
> +
> +Perf subsystem users can not get PMU counter or resource directly, user
> +should create a perf event first and specify event’s attribute which is
> +used to choose PMU counters, then perf event joins in perf scheduler,
> +perf scheduler assigns the corresponding PMU counter to a perf event.
> +
> +Perf event is created by perf_event_open() system call:
> +int syscall(SYS_perf_event_open, struct perf_event_attr *attr,
> +	    pid_t pid, int cpu, int roup_fd, unsigned long flags)
> +
> +    struct perf_event_attr {
> +	    ......
> +	    /* Major type: hardware/software/tracepoint/etc. */
> +	    __u32   type;
> +	    /* Type specific configuration information. */
> +	    __u64   config;
> +	    union {
> +		    __u64      sample_period;
> +		    __u64      sample_freq;
> +	    }
> +	   __u64   disabled :1;
> +	           pinned   :1;
> +		   exclude_user  :1;
> +		   exclude_kernel :1;
> +		   exclude_host   :1;
> +	           exclude_guest  :1;
> +	    ......
> +    }
> +
> +The pid and cpu arguments allow specifying which process and CPU
> +to monitor:
> +pid == 0 and cpu == -1
> +        This measures the calling process/thread on any CPU.
> +pid == 0 and cpu >= 0
> +        This measures the calling process/thread only when running on
> +	the specified cpu.
> +pid > 0 and cpu == -1
> +        This measures the specified process/thread on any cpu.
> +pid > 0 and cpu >= 0
> +        This  measures the specified process/thread only when running
> +	on the specified CPU.
> +pid == -1 and cpu >= 0
> +        This measures all processes/threads on the specified CPU.
> +pid == -1 and cpu == -1
> +        This setting is invalid and will return an error.
> +
> +Perf scheduler's responsibility is choosing which events are active at
> +one moment and binding counter with perf event. As processor has limited
> +PMU counters and other resource, only limited perf events can be active
> +at one moment, the inactive perf event may be active in the next moment,
> +perf scheduler has defined rules to control these things.
> +
> +Perf scheduler defines four types of perf event, defined by the pid and
> +cpu arguments in perf_event_open(), plus perf_event_attr.pinned, their
> +schedule priority are: per_cpu pinned > per_process pinned
> +> per_cpu flexible > per_process flexible. High priority events can
> +preempt low priority events when resources contend.
> +
> +--------------------------------------------------------
> +|                      |   pid   |   cpu   |   pinned  |
> +--------------------------------------------------------
> +| Per-cpu pinned       |   *    |   >= 0   |     1     |
> +--------------------------------------------------------
> +| Per-process pinned   |  >= 0  |    *     |     1     |
> +--------------------------------------------------------
> +| Per-cpu flexible     |   *    |   >= 0   |     0     |
> +--------------------------------------------------------
> +| Per-process flexible | >= 0   |    *     |     0     |
> +--------------------------------------------------------
> +
> +    struct perf_event {
> +	    struct list_head       event_entry;
> +	    ......
> +	    struct pmu             *pmu;
> +	    enum perf_event_state  state;
> +	    local64_t              count;
> +	    u64                    total_time_enabled;
> +	    u64                    total_time_running;
> +	    struct perf_event_attr attr;
> +	    ......
> +    }
> +
> +For per-cpu perf event, it is linked into per cpu global variable
> +perf_cpu_context, for per-process perf event, it is linked into
> +task_struct->perf_event_context.
> +
> +Usually the following cases cause perf event reschedule:
> +1) In a context switch from one task to a different task.
> +2) When an event is manually enabled.
> +3) A call to perf_event_open() with disabled field of the
> +perf_event_attr argument set to 0.
> +
> +When perf_event_open() or perf_event_enable() is called, perf event
> +reschedule is needed on a specific cpu, perf will send an IPI to the
> +target cpu, and the IPI handler will activate events ordered by event
> +type, and will iterate all the eligible events in per cpu gloable
> +variable perf_cpu_context and current->perf_event_context.
> +
> +When a perf event is sched out, this event mapped counter is disabled,
> +and the counter's setting and count value are saved. When a perf event
> +is sched in, perf driver assigns a counter to this event, the counter's
> +setting and count values are restored from last saved.
> +
> +If the event could not be scheduled because no resource is available for
> +it, pinned event goes into error state and is excluded from perf
> +scheduler, the only way to recover it is re-enable it, flexible event
> +goes into inactive state and can be multiplexed with other events if
> +needed.
> +
> +
> +3. Arch PMU virtualization
> +==========================
> +
> +3.1. Overview
> +-------------
> +
> +Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest
> +PMU driver would access the Arch PMU MSRs (including Fixed and GP
> +counter) as the host does. All the guest Arch PMU MSRs accessing are
> +interceptable.
> +
> +When a guest virtual counter is enabled through guest MSR writing, the
> +KVM trap will create a kvm perf event through the perf subsystem. The
> +kvm perf event's attribute is gotten from the guest virtual counter's
> +MSR setting.
> +
> +When a guest changes the virtual counter's setting later, the KVM trap
> +will release the old kvm perf event then create a new kvm perf event
> +with the new setting.
> +
> +When guest read the virtual counter's count number, the kvm trap will
> +read kvm perf event's counter value and accumulate it to the previous
> +counter value.
> +
> +When guest no longer access the virtual counter's MSR within a
> +scheduling time slice and the virtual counter is disabled, KVM will
> +release the kvm perf event.
> +
> +  ----------------------------
> +  |  Guest                   |
> +  |  perf subsystem          |
> +  ----------------------------
> +       |            ^
> +  vMSR |            | vPMI
> +       v            |
> +  ----------------------------
> +  |  vPMU        KVM vCPU    |
> +  ----------------------------
> +        |          ^
> +  Call  |          | Callbacks
> +        v          |
> +  ---------------------------
> +  | Host Linux Kernel       |
> +  | perf subsystem          |
> +  ---------------------------
> +               |       ^
> +           MSR |       | PMI
> +               v       |
> +         --------------------
> +	 | PMU        CPU   |
> +         --------------------
> +
> +Each guest virtual counter has a corresponding kvm perf event, and the
> +kvm perf event joins host perf scheduler and complies with host perf
> +scheduler rule. When kvm perf event is scheduled by host perf scheduler
> +and is active, the guest virtual counter could supply the correct value.
> +However, if another host perf event comes in and takes over the kvm perf
> +event resource, the kvm perf event will be inactive, then the virtual
> +counter keeps the saved value when the kvm perf event is preempted. But
> +guest perf doesn't notice the underbeach virtual counter is stopped, so
> +the final guest profiling data is wrong.
> +
> +3.2. Host and Guest perf event contention
> +-----------------------------------------
> +
> +Kvm perf event is a per-process pinned event, its priority is second.
> +When kvm perf event is active, it can be preempted by host per-cpu
> +pinned perf event, or it can preempt host flexible perf events. Such
> +preemption can be temporarily prohibited through disabling host IRQ.
> +
> +The following results are expected when host and guest perf event
> +coexist according to perf scheduler rule:
> +1). if host per cpu pinned events occupy all the HW resource, kvm perf
> +event can not be active as no available resource, the virtual counter
> +value is  zero always when the guest reads it.
> +2). if host per cpu pinned event release HW resource, and kvm perf event
> +is inactive, kvm perf event can claim the HW resource and switch into
> +active, then the guest can get the correct value from the guest virtual
> +counter during kvm perf event is active, but the guest total counter
> +value is not correct since counter value is lost during kvm perf event
> +is inactive.
> +3). if kvm perf event is active, then host per cpu pinned perf event
> +becomes active and reclaims kvm perf event resource, kvm perf event will
> +be inactive. Finally the virtual counter value is kept unchanged and
> +stores previous saved value when the guest reads it. So the guest total
> +counter isn't correct.
> +4). If host flexible perf events occupy all the HW resource, kvm perf
> +event can be active and preempts host flexible perf event resource,
> +the guest can get the correct value from the guest virtual counter.
> +5). if kvm perf event is active, then other host flexible perf events
> +request to active, kvm perf event still own the resource and active, so
> +the guest can get the correct value from the guest virtual counter.
> +
> +3.3. vPMU Arch Gaps
> +-------------------
> +
> +The coexist of host and guest perf events has gap:
> +1). when guest accesses PMU MSRs at the first time, KVM will trap it and
> +create kvm perf event, but this event may be inactive because the

inactive? It seems the event should enter error state base on previous 
description?

> +contention with host perf event. But guest doesn't notice this and when
> +guest read virtual counter, the return value is zero.
> +2). when kvm perf event is active, host per-cpu pinned perf event can
> +reclaim kvm perf event resource at any time once resource contention
> +happens. But guest doesn't notice this neither and guest following
> +counter accesses get wrong data.
> +So maillist had some discussion titled "Reconsider the current approach
> +of vPMU".
> +
> +https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gmail.com/
> +
> +The major suggestion in this discussion is host pass-through some
> +counters into guest, but this suggestion is not feasible, the reasons
> +are:
> +a. processor has several counters, but counters are not equal, some
> +event must bind with a specific counter.
> +b. if a special counter is passthrough into guest, host can not support
> +such events and lose some capability.
> +c. if a normal counter is passthrough into guest, guest can support
> +general event only, and the guest has limited capability.
> +So both host and guest lose capability in pass-through mode.
> +
> +4. LBR Virtualization
> +=====================
> +
> +4.1. Overview
> +-------------
> +
> +The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR
> +and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability
> +into guest,  The first guest access on LBR related MSRs is always
> +interceptable. The KVM trap would create a vLBR perf event which enables
> +the callstack mode and none of the hardware counters are assigned. The
> +host perf would enable and schedule this event as usual.
> +
> +When vLBR event is scheduled by host perf scheduler and is active, host
> +LBR MSRs are owned by guest and are pass-through into guest, guest will
> +access them without VM Exit. However, if another host LBR event comes in
> +and takes over the LBR facility, the vLBR event will be inactive, and
> +the guest following access to the LBR MSRs will be trapped and
> +meaningless.
> +
> +As kvm perf event, vLBR event will be released when guest doesn't access
> +LBR-related MSRs within a scheduling time slice and guest unset LBR
> +enable bit, then the pass-through state of the LBR MSRs will be canceled.
> +
> +4.2. Host and Guest LBR contention
> +----------------------------------
> +
> +vLBR event is a per-process pinned event, its priority is second. vLBR
> +event together with host other LBR event to contend LBR resource,
> +according to perf scheduler rule, when vLBR event is active, it can be
> +preempted by host per-cpu pinned LBR event, or it can preempt host
> +flexible LBR event. Such preemption can be temporarily prohibited
> +through disabling host IRQ as perf scheduler uses IPI to change LBR owner.
> +
> +The following results are expected when host and guest LBR event coexist:
> +1) If host per cpu pinned LBR event is active when vm starts, the guest
> +vLBR event can not preempt the LBR resource, so the guest can not use
> +LBR.
> +2). If host flexible LBR events are active when vm starts, guest vLBR
> +event can preempt LBR, so the guest can use LBR.
> +3). If host per cpu pinned LBR event becomes enabled when guest vLBR
> +event is active, the guest vLBR event will lose LBR and the guest can
> +not use LBR anymore.
> +4). If host flexible LBR event becomes enabled when guest vLBR event is
> +active, the guest vLBR event keeps LBR, the guest can still use LBR.
> +5). If host per cpu pinned LBR event becomes inactive when guest vLBR
> +event is inactive, guest vLBR event can be active and own LBR, the guest
> +can use LBR.
> +
> +4.3. vLBR Arch Gaps
> +-------------------
> +
> +Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned
> +event at any time, or vLBR event is inactive at creation, but guest
> +can not notice this, so the guest will get meaningless value when the
> +vLBR event is inactive.
>
> base-commit: 88bb466c9dec4f70d682cf38c685324e7b1b3d60

next prev parent reply	other threads:[~2023-08-02  5:57 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-01  3:58 [PATCH v2] Documentation: KVM: Add vPMU implementaion and gap document Xiong Zhang
2023-08-02  1:14 ` kernel test robot
2023-08-02  5:57 ` Mi, Dapeng [this message]
2023-08-03  0:52   ` Zhang, Xiong Y

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e0e478bd-282b-68fa-7c94-8efbcc450750@linux.intel.com \
    --to=dapeng1.mi@linux.intel.com \
    --cc=kan.liang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=like.xu.linux@gmail.com \
    --cc=seanjc@google.com \
    --cc=weijiang.yang@intel.com \
    --cc=xiong.y.zhang@intel.com \
    --cc=zhenyu.z.wang@intel.com \
    --cc=zhiyuan.lv@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox