From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kvm-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BA319C001DF
	for <kvm@archiver.kernel.org>; Wed,  2 Aug 2023 05:57:38 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231314AbjHBF5h (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 2 Aug 2023 01:57:37 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52750 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229436AbjHBF5g (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 2 Aug 2023 01:57:36 -0400
Received: from mgamail.intel.com (unknown [192.55.52.43])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EB4152122
        for <kvm@vger.kernel.org>; Tue,  1 Aug 2023 22:57:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1690955854; x=1722491854;
  h=message-id:date:mime-version:subject:to:cc:references:
   from:in-reply-to:content-transfer-encoding;
  bh=DDCG9G8B7JjR5/3p2DOxhxhlOun5LPMyvVYR7Xc2qDA=;
  b=UX+uhmW/iZgC49IN/J4KmVkIGHtdd7P6Kh5mYH/7scdfsGKmM1oUtveU
   hdNjvh+7Kgd0FxU7qDeEMX22QnNiVvSVZn1UCgKJEtgHNAIzBc61kWMHl
   EJDPbMgp5hrws7vRDESZHCh1J6oXkdzVWPW9tgdHR3Ty3A5mIqK3V8Hb7
   itDpURXKAXAzZqW7jla8UurbxEwWLBE8xjfK5euNnk2H8JaDYHayhYH6a
   f3SfkKAqjX14nqkpmfIPaKimr+ulFdsAArkSDl8tcv111n8v6G9y3mAqT
   wt97X9wi9klJKDTD4nMW+6kREpEuv92eHGfJaVSbLIz9UszIPqQlZllCC
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10789"; a="455857754"
X-IronPort-AV: E=Sophos;i="6.01,248,1684825200"; 
   d="scan'208";a="455857754"
Received: from orsmga005.jf.intel.com ([10.7.209.41])
  by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Aug 2023 22:57:34 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10789"; a="902866407"
X-IronPort-AV: E=Sophos;i="6.01,248,1684825200"; 
   d="scan'208";a="902866407"
Received: from dapengmi-mobl1.ccr.corp.intel.com (HELO [10.93.20.184]) ([10.93.20.184])
  by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Aug 2023 22:57:31 -0700
Message-ID: <e0e478bd-282b-68fa-7c94-8efbcc450750@linux.intel.com>
Date:   Wed, 2 Aug 2023 13:57:29 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.14.0
Subject: Re: [PATCH v2] Documentation: KVM: Add vPMU implementaion and gap
 document
To:     Xiong Zhang <xiong.y.zhang@intel.com>, kvm@vger.kernel.org
Cc:     weijiang.yang@intel.com, seanjc@google.com,
        like.xu.linux@gmail.com, zhiyuan.lv@intel.com,
        zhenyu.z.wang@intel.com, kan.liang@intel.com
References: <20230801035836.1048879-1-xiong.y.zhang@intel.com>
Content-Language: en-US
From:   "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
In-Reply-To: <20230801035836.1048879-1-xiong.y.zhang@intel.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

> Add a vPMU implementation and gap document to explain vArch PMU and vLBR
> implementation in kvm, especially the current gap to support host and
> guest perf event coexist.
>
> Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
> ---
> Changelog:
> v1 -> v2:
> * Refactor perf scheduler section
> * Correct one sentence in vArch PMU section
> ---
>   Documentation/virt/kvm/x86/index.rst |   1 +
>   Documentation/virt/kvm/x86/pmu.rst   | 325 +++++++++++++++++++++++++++
>   2 files changed, 326 insertions(+)
>   create mode 100644 Documentation/virt/kvm/x86/pmu.rst
>
> diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
> index 9ece6b8dc817..02c1c7b01bf3 100644
> --- a/Documentation/virt/kvm/x86/index.rst
> +++ b/Documentation/virt/kvm/x86/index.rst
> @@ -14,5 +14,6 @@ KVM for x86 systems
>      mmu
>      msr
>      nested-vmx
> +   pmu
>      running-nested-guests
>      timekeeping
> diff --git a/Documentation/virt/kvm/x86/pmu.rst b/Documentation/virt/kvm/x86/pmu.rst
> new file mode 100644
> index 000000000000..1b83390bacbf
> --- /dev/null
> +++ b/Documentation/virt/kvm/x86/pmu.rst
> @@ -0,0 +1,325 @@
> +﻿.. SPDX-License-Identifier: GPL-2.0
> +
> +==========================
> +PMU virtualization for X86
> +==========================
> +
> +:Author: Xiong Zhang <xiong.y.zhang@intel.com>
> +:Copyright: (c) 2023, Intel.  All rights reserved.
> +
> +.. Contents
> +
> +1. Overview
> +2. Perf Scheduler Basic
> +3. Arch PMU virtualization
> +4. LBR virtualization
> +
> +1. Overview
> +===========
> +
> +KVM has supported PMU virtualization on x86 for many years and provides
> +MSR based Arch PMU interface to the guest. The major features include
> +Arch PMU v2, LBR and PEBS. Users have the same operation to profile
> +performance in guest and host.
> +KVM is a normal perf subsystem user as other perf subsystem users. When
> +the guest access vPMU MSRs, KVM traps it and creates a perf event for it.
> +This perf event takes part in perf scheduler to request PMU resources
> +and let the guest use these resources.
> +
> +This document describes the X86 PMU virtualization architecture design
> +and opens. It is organized as follows: Next section describes more
> +details of Linux perf scheduler as it takes a key role in vPMU
> +implementation and allocates PMU resources for guest usage. Then Arch
> +PMU virtualization and LBR virtualization are introduced, each feature
> +has sections to introduce implementation overview,  the expectation and
> +gaps when host and guest perf events coexist.
> +
> +2. Perf Scheduler Basic
> +=======================
> +
> +Perf subsystem users can not get PMU counter or resource directly, user
> +should create a perf event first and specify event’s attribute which is
> +used to choose PMU counters, then perf event joins in perf scheduler,
> +perf scheduler assigns the corresponding PMU counter to a perf event.
> +
> +Perf event is created by perf_event_open() system call:
> +int syscall(SYS_perf_event_open, struct perf_event_attr *attr,
> +	    pid_t pid, int cpu, int roup_fd, unsigned long flags)
> +
> +    struct perf_event_attr {
> +	    ......
> +	    /* Major type: hardware/software/tracepoint/etc. */
> +	    __u32   type;
> +	    /* Type specific configuration information. */
> +	    __u64   config;
> +	    union {
> +		    __u64      sample_period;
> +		    __u64      sample_freq;
> +	    }
> +	   __u64   disabled :1;
> +	           pinned   :1;
> +		   exclude_user  :1;
> +		   exclude_kernel :1;
> +		   exclude_host   :1;
> +	           exclude_guest  :1;
> +	    ......
> +    }
> +
> +The pid and cpu arguments allow specifying which process and CPU
> +to monitor:
> +pid == 0 and cpu == -1
> +        This measures the calling process/thread on any CPU.
> +pid == 0 and cpu >= 0
> +        This measures the calling process/thread only when running on
> +	the specified cpu.
> +pid > 0 and cpu == -1
> +        This measures the specified process/thread on any cpu.
> +pid > 0 and cpu >= 0
> +        This  measures the specified process/thread only when running
> +	on the specified CPU.
> +pid == -1 and cpu >= 0
> +        This measures all processes/threads on the specified CPU.
> +pid == -1 and cpu == -1
> +        This setting is invalid and will return an error.
> +
> +Perf scheduler's responsibility is choosing which events are active at
> +one moment and binding counter with perf event. As processor has limited
> +PMU counters and other resource, only limited perf events can be active
> +at one moment, the inactive perf event may be active in the next moment,
> +perf scheduler has defined rules to control these things.
> +
> +Perf scheduler defines four types of perf event, defined by the pid and
> +cpu arguments in perf_event_open(), plus perf_event_attr.pinned, their
> +schedule priority are: per_cpu pinned > per_process pinned
> +> per_cpu flexible > per_process flexible. High priority events can
> +preempt low priority events when resources contend.
> +
> +--------------------------------------------------------
> +|                      |   pid   |   cpu   |   pinned  |
> +--------------------------------------------------------
> +| Per-cpu pinned       |   *    |   >= 0   |     1     |
> +--------------------------------------------------------
> +| Per-process pinned   |  >= 0  |    *     |     1     |
> +--------------------------------------------------------
> +| Per-cpu flexible     |   *    |   >= 0   |     0     |
> +--------------------------------------------------------
> +| Per-process flexible | >= 0   |    *     |     0     |
> +--------------------------------------------------------
> +
> +    struct perf_event {
> +	    struct list_head       event_entry;
> +	    ......
> +	    struct pmu             *pmu;
> +	    enum perf_event_state  state;
> +	    local64_t              count;
> +	    u64                    total_time_enabled;
> +	    u64                    total_time_running;
> +	    struct perf_event_attr attr;
> +	    ......
> +    }
> +
> +For per-cpu perf event, it is linked into per cpu global variable
> +perf_cpu_context, for per-process perf event, it is linked into
> +task_struct->perf_event_context.
> +
> +Usually the following cases cause perf event reschedule:
> +1) In a context switch from one task to a different task.
> +2) When an event is manually enabled.
> +3) A call to perf_event_open() with disabled field of the
> +perf_event_attr argument set to 0.
> +
> +When perf_event_open() or perf_event_enable() is called, perf event
> +reschedule is needed on a specific cpu, perf will send an IPI to the
> +target cpu, and the IPI handler will activate events ordered by event
> +type, and will iterate all the eligible events in per cpu gloable
> +variable perf_cpu_context and current->perf_event_context.
> +
> +When a perf event is sched out, this event mapped counter is disabled,
> +and the counter's setting and count value are saved. When a perf event
> +is sched in, perf driver assigns a counter to this event, the counter's
> +setting and count values are restored from last saved.
> +
> +If the event could not be scheduled because no resource is available for
> +it, pinned event goes into error state and is excluded from perf
> +scheduler, the only way to recover it is re-enable it, flexible event
> +goes into inactive state and can be multiplexed with other events if
> +needed.
> +
> +
> +3. Arch PMU virtualization
> +==========================
> +
> +3.1. Overview
> +-------------
> +
> +Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest
> +PMU driver would access the Arch PMU MSRs (including Fixed and GP
> +counter) as the host does. All the guest Arch PMU MSRs accessing are
> +interceptable.
> +
> +When a guest virtual counter is enabled through guest MSR writing, the
> +KVM trap will create a kvm perf event through the perf subsystem. The
> +kvm perf event's attribute is gotten from the guest virtual counter's
> +MSR setting.
> +
> +When a guest changes the virtual counter's setting later, the KVM trap
> +will release the old kvm perf event then create a new kvm perf event
> +with the new setting.
> +
> +When guest read the virtual counter's count number, the kvm trap will
> +read kvm perf event's counter value and accumulate it to the previous
> +counter value.
> +
> +When guest no longer access the virtual counter's MSR within a
> +scheduling time slice and the virtual counter is disabled, KVM will
> +release the kvm perf event.
> +
> +  ----------------------------
> +  |  Guest                   |
> +  |  perf subsystem          |
> +  ----------------------------
> +       |            ^
> +  vMSR |            | vPMI
> +       v            |
> +  ----------------------------
> +  |  vPMU        KVM vCPU    |
> +  ----------------------------
> +        |          ^
> +  Call  |          | Callbacks
> +        v          |
> +  ---------------------------
> +  | Host Linux Kernel       |
> +  | perf subsystem          |
> +  ---------------------------
> +               |       ^
> +           MSR |       | PMI
> +               v       |
> +         --------------------
> +	 | PMU        CPU   |
> +         --------------------
> +
> +Each guest virtual counter has a corresponding kvm perf event, and the
> +kvm perf event joins host perf scheduler and complies with host perf
> +scheduler rule. When kvm perf event is scheduled by host perf scheduler
> +and is active, the guest virtual counter could supply the correct value.
> +However, if another host perf event comes in and takes over the kvm perf
> +event resource, the kvm perf event will be inactive, then the virtual
> +counter keeps the saved value when the kvm perf event is preempted. But
> +guest perf doesn't notice the underbeach virtual counter is stopped, so
> +the final guest profiling data is wrong.
> +
> +3.2. Host and Guest perf event contention
> +-----------------------------------------
> +
> +Kvm perf event is a per-process pinned event, its priority is second.
> +When kvm perf event is active, it can be preempted by host per-cpu
> +pinned perf event, or it can preempt host flexible perf events. Such
> +preemption can be temporarily prohibited through disabling host IRQ.
> +
> +The following results are expected when host and guest perf event
> +coexist according to perf scheduler rule:
> +1). if host per cpu pinned events occupy all the HW resource, kvm perf
> +event can not be active as no available resource, the virtual counter
> +value is  zero always when the guest reads it.
> +2). if host per cpu pinned event release HW resource, and kvm perf event
> +is inactive, kvm perf event can claim the HW resource and switch into
> +active, then the guest can get the correct value from the guest virtual
> +counter during kvm perf event is active, but the guest total counter
> +value is not correct since counter value is lost during kvm perf event
> +is inactive.
> +3). if kvm perf event is active, then host per cpu pinned perf event
> +becomes active and reclaims kvm perf event resource, kvm perf event will
> +be inactive. Finally the virtual counter value is kept unchanged and
> +stores previous saved value when the guest reads it. So the guest total
> +counter isn't correct.
> +4). If host flexible perf events occupy all the HW resource, kvm perf
> +event can be active and preempts host flexible perf event resource,
> +the guest can get the correct value from the guest virtual counter.
> +5). if kvm perf event is active, then other host flexible perf events
> +request to active, kvm perf event still own the resource and active, so
> +the guest can get the correct value from the guest virtual counter.
> +
> +3.3. vPMU Arch Gaps
> +-------------------
> +
> +The coexist of host and guest perf events has gap:
> +1). when guest accesses PMU MSRs at the first time, KVM will trap it and
> +create kvm perf event, but this event may be inactive because the

inactive? It seems the event should enter error state base on previous 
description?

> +contention with host perf event. But guest doesn't notice this and when
> +guest read virtual counter, the return value is zero.
> +2). when kvm perf event is active, host per-cpu pinned perf event can
> +reclaim kvm perf event resource at any time once resource contention
> +happens. But guest doesn't notice this neither and guest following
> +counter accesses get wrong data.
> +So maillist had some discussion titled "Reconsider the current approach
> +of vPMU".
> +
> +https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gmail.com/
> +
> +The major suggestion in this discussion is host pass-through some
> +counters into guest, but this suggestion is not feasible, the reasons
> +are:
> +a. processor has several counters, but counters are not equal, some
> +event must bind with a specific counter.
> +b. if a special counter is passthrough into guest, host can not support
> +such events and lose some capability.
> +c. if a normal counter is passthrough into guest, guest can support
> +general event only, and the guest has limited capability.
> +So both host and guest lose capability in pass-through mode.
> +
> +4. LBR Virtualization
> +=====================
> +
> +4.1. Overview
> +-------------
> +
> +The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR
> +and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability
> +into guest,  The first guest access on LBR related MSRs is always
> +interceptable. The KVM trap would create a vLBR perf event which enables
> +the callstack mode and none of the hardware counters are assigned. The
> +host perf would enable and schedule this event as usual.
> +
> +When vLBR event is scheduled by host perf scheduler and is active, host
> +LBR MSRs are owned by guest and are pass-through into guest, guest will
> +access them without VM Exit. However, if another host LBR event comes in
> +and takes over the LBR facility, the vLBR event will be inactive, and
> +the guest following access to the LBR MSRs will be trapped and
> +meaningless.
> +
> +As kvm perf event, vLBR event will be released when guest doesn't access
> +LBR-related MSRs within a scheduling time slice and guest unset LBR
> +enable bit, then the pass-through state of the LBR MSRs will be canceled.
> +
> +4.2. Host and Guest LBR contention
> +----------------------------------
> +
> +vLBR event is a per-process pinned event, its priority is second. vLBR
> +event together with host other LBR event to contend LBR resource,
> +according to perf scheduler rule, when vLBR event is active, it can be
> +preempted by host per-cpu pinned LBR event, or it can preempt host
> +flexible LBR event. Such preemption can be temporarily prohibited
> +through disabling host IRQ as perf scheduler uses IPI to change LBR owner.
> +
> +The following results are expected when host and guest LBR event coexist:
> +1) If host per cpu pinned LBR event is active when vm starts, the guest
> +vLBR event can not preempt the LBR resource, so the guest can not use
> +LBR.
> +2). If host flexible LBR events are active when vm starts, guest vLBR
> +event can preempt LBR, so the guest can use LBR.
> +3). If host per cpu pinned LBR event becomes enabled when guest vLBR
> +event is active, the guest vLBR event will lose LBR and the guest can
> +not use LBR anymore.
> +4). If host flexible LBR event becomes enabled when guest vLBR event is
> +active, the guest vLBR event keeps LBR, the guest can still use LBR.
> +5). If host per cpu pinned LBR event becomes inactive when guest vLBR
> +event is inactive, guest vLBR event can be active and own LBR, the guest
> +can use LBR.
> +
> +4.3. vLBR Arch Gaps
> +-------------------
> +
> +Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned
> +event at any time, or vLBR event is inactive at creation, but guest
> +can not notice this, so the guest will get meaningless value when the
> +vLBR event is inactive.
>
> base-commit: 88bb466c9dec4f70d682cf38c685324e7b1b3d60