From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gleb Natapov Subject: Re: [PATCH 4/4] kvm: Implement PEBS virtualization Date: Fri, 30 May 2014 11:21:37 +0300 Message-ID: <20140530082136.GA4715@minantech.com> References: <1401412327-14810-1-git-send-email-andi@firstfloor.org> <1401412327-14810-5-git-send-email-andi@firstfloor.org> Mime-Version: 1.0 Content-Type: text/plain; charset=cp1255 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: peterz@infradead.org, pbonzini@redhat.com, eranian@google.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Andi Kleen To: Andi Kleen Return-path: Content-Disposition: inline In-Reply-To: <1401412327-14810-5-git-send-email-andi@firstfloor.org> Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote: > From: Andi Kleen >=20 > PEBS (Precise Event Bases Sampling) profiling is very powerful, > allowing improved sampling precision and much additional information, > like address or TSX abort profiling. cycles:p and :pp uses PEBS. >=20 > This patch enables PEBS profiling in KVM guests. That sounds really cool! >=20 > PEBS writes profiling records to a virtual address in memory. Since > the guest controls the virtual address space the PEBS record > is directly delivered to the guest buffer. We set up the PEBS state > that is works correctly.The CPU cannot handle any kinds of faults dur= ing > these guest writes. >=20 > To avoid any problems with guest pages being swapped by the host we > pin the pages when the PEBS buffer is setup, by intercepting > that MSR. It will avoid guest page to be swapped, but shadow paging code may stil= l drop shadow PT pages that build a mapping from DS virtual address to the gue= st page. With EPT it is less likely to happen (but still possible IIRC depending= on memory pressure and how much memory shadow paging code is allowed to use), wit= hout EPT it will happen for sure. >=20 > Typically profilers only set up a single page, so pinning that is not > a big problem. The pinning is limited to 17 pages currently (64K+1) >=20 > In theory the guest can change its own page tables after the PEBS > setup. The host has no way to track that with EPT. But if a guest > would do that it could only crash itself. It's not expected > that normal profilers do that. Spec says: The following restrictions should be applied to the DS save area. =95 The three DS save area sections should be allocated from a non-paged pool, and marked accessed and dirty. It is the responsibil= ity of the operating system to keep the pages that contain the buffer present and to mark them accessed and dirty. The implication is that the operating system cannot do =93lazy=94 page-table entry propagati= on for these pages. There is nothing, as far as I can see, that says what will happen if th= e condition is not met. I always interpreted it as undefined behaviour so anything can happen including CPU dies completely. You are saying abov= e on one hand that CPU cannot handle any kinds of faults during write to DS area, but on the other hand a guest could only crash itself. Is this architecturally guarantied? >=20 > The patch also adds the basic glue to enable the PEBS CPUIDs > and other PEBS MSRs, and ask perf to enable PEBS as needed. >=20 > Due to various limitations it currently only works on Silvermont > based systems. >=20 > This patch doesn't implement the extended MSRs some CPUs support. > For example latency profiling on SLM will not work at this point. >=20 > Timing: >=20 > The emulation is somewhat more expensive than a real PMU. This > may trigger the expensive PMI detection in the guest. > Usually this can be disabled with > echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent >=20 > Migration: >=20 > In theory it should should be possible (as long as we migrate to > a host with the same PEBS event and the same PEBS format), but I'm no= t > sure the basic KVM PMU code supports it correctly: no code to > save/restore state, unless I'm missing something. Once the PMU > code grows proper migration support it should be straight forward > to handle the PEBS state too. >=20 > Signed-off-by: Andi Kleen > --- > arch/x86/include/asm/kvm_host.h | 6 ++ > arch/x86/include/uapi/asm/msr-index.h | 4 + > arch/x86/kvm/cpuid.c | 10 +- > arch/x86/kvm/pmu.c | 184 ++++++++++++++++++++++++= ++++++++-- > arch/x86/kvm/vmx.c | 6 ++ > 5 files changed, 196 insertions(+), 14 deletions(-) >=20 > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/k= vm_host.h > index 7de069af..d87cb66 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -319,6 +319,8 @@ struct kvm_pmc { > struct kvm_vcpu *vcpu; > }; > =20 > +#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */ > + > struct kvm_pmu { > unsigned nr_arch_gp_counters; > unsigned nr_arch_fixed_counters; > @@ -335,6 +337,10 @@ struct kvm_pmu { > struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED]; > struct irq_work irq_work; > u64 reprogram_pmi; > + u64 pebs_enable; > + u64 ds_area; > + struct page *pinned_pages[MAX_PINNED_PAGES]; > + unsigned num_pinned_pages; > }; > =20 > enum { > diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include= /uapi/asm/msr-index.h > index fcf2b3a..409a582 100644 > --- a/arch/x86/include/uapi/asm/msr-index.h > +++ b/arch/x86/include/uapi/asm/msr-index.h > @@ -72,6 +72,10 @@ > #define MSR_IA32_PEBS_ENABLE 0x000003f1 > #define MSR_IA32_DS_AREA 0x00000600 > #define MSR_IA32_PERF_CAPABILITIES 0x00000345 > +#define PERF_CAP_PEBS_TRAP (1U << 6) > +#define PERF_CAP_ARCH_REG (1U << 7) > +#define PERF_CAP_PEBS_FORMAT (0xf << 8) > + > #define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6 > =20 > #define MSR_MTRRfix64K_00000 0x00000250 > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > index f47a104..c8cc76b 100644 > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -260,6 +260,10 @@ static inline int __do_cpuid_ent(struct kvm_cpui= d_entry2 *entry, u32 function, > unsigned f_rdtscp =3D kvm_x86_ops->rdtscp_supported() ? F(RDTSCP) := 0; > unsigned f_invpcid =3D kvm_x86_ops->invpcid_supported() ? F(INVPCID= ) : 0; > unsigned f_mpx =3D kvm_x86_ops->mpx_supported() ? F(MPX) : 0; > + bool pebs =3D perf_pebs_virtualization(); > + unsigned f_ds =3D pebs ? F(DS) : 0; > + unsigned f_pdcm =3D pebs ? F(PDCM) : 0; > + unsigned f_dtes64 =3D pebs ? F(DTES64) : 0; > =20 > /* cpuid 1.edx */ > const u32 kvm_supported_word0_x86_features =3D > @@ -268,7 +272,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid= _entry2 *entry, u32 function, > F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) | > F(MTRR) | F(PGE) | F(MCA) | F(CMOV) | > F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) | > - 0 /* Reserved, DS, ACPI */ | F(MMX) | > + f_ds /* Reserved, ACPI */ | F(MMX) | > F(FXSR) | F(XMM) | F(XMM2) | F(SELFSNOOP) | > 0 /* HTT, TM, Reserved, PBE */; > /* cpuid 0x80000001.edx */ > @@ -283,10 +287,10 @@ static inline int __do_cpuid_ent(struct kvm_cpu= id_entry2 *entry, u32 function, > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); > /* cpuid 1.ecx */ > const u32 kvm_supported_word4_x86_features =3D > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | > + F(XMM3) | F(PCLMULQDQ) | f_dtes64 /* MONITOR */ | > 0 /* DS-CPL, VMX, SMX, EST */ | > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | > - F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ | > + F(FMA) | F(CX16) | f_pdcm /* xTPR Update */ | > F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) | > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) | > 0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) | > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c > index 4c6f417..6362db7 100644 > --- a/arch/x86/kvm/pmu.c > +++ b/arch/x86/kvm/pmu.c > @@ -15,9 +15,11 @@ > #include > #include > #include > +#include > #include "x86.h" > #include "cpuid.h" > #include "lapic.h" > +#include "mmu.h" > =20 > static struct kvm_arch_event_perf_mapping { > u8 eventsel; > @@ -36,9 +38,23 @@ static struct kvm_arch_event_perf_mapping { > [7] =3D { 0x00, 0x30, PERF_COUNT_HW_REF_CPU_CYCLES }, > }; > =20 > +struct debug_store { > + u64 bts_buffer_base; > + u64 bts_index; > + u64 bts_absolute_maximum; > + u64 bts_interrupt_threshold; > + u64 pebs_buffer_base; > + u64 pebs_index; > + u64 pebs_absolute_maximum; > + u64 pebs_interrupt_threshold; > + u64 pebs_event_reset[4]; > +}; > + > /* mapping between fixed pmc index and arch_events array */ > int fixed_pmc_events[] =3D {1, 0, 7}; > =20 > +static u64 host_perf_cap __read_mostly; > + > static bool pmc_is_gp(struct kvm_pmc *pmc) > { > return pmc->type =3D=3D KVM_PMC_GP; > @@ -108,7 +124,10 @@ static void kvm_perf_overflow(struct perf_event = *perf_event, > { > struct kvm_pmc *pmc =3D perf_event->overflow_handler_context; > struct kvm_pmu *pmu =3D &pmc->vcpu->arch.pmu; > - __set_bit(pmc->idx, (unsigned long *)&pmu->global_status); > + if (perf_event->attr.precise_ip) > + __set_bit(62, (unsigned long *)&pmu->global_status); > + else > + __set_bit(pmc->idx, (unsigned long *)&pmu->global_status); > } > =20 > static void kvm_perf_overflow_intr(struct perf_event *perf_event, > @@ -160,7 +179,7 @@ static void stop_counter(struct kvm_pmc *pmc) > =20 > static void reprogram_counter(struct kvm_pmc *pmc, u32 type, > unsigned config, bool exclude_user, bool exclude_kernel, > - bool intr, bool in_tx, bool in_tx_cp) > + bool intr, bool in_tx, bool in_tx_cp, bool pebs) > { > struct perf_event *event; > struct perf_event_attr attr =3D { > @@ -177,18 +196,20 @@ static void reprogram_counter(struct kvm_pmc *p= mc, u32 type, > attr.config |=3D HSW_IN_TX; > if (in_tx_cp) > attr.config |=3D HSW_IN_TX_CHECKPOINTED; > + if (pebs) > + attr.precise_ip =3D 1; > =20 > attr.sample_period =3D (-pmc->counter) & pmc_bitmask(pmc); > =20 > - event =3D perf_event_create_kernel_counter(&attr, -1, current, > - intr ? kvm_perf_overflow_intr : > - kvm_perf_overflow, pmc); > + event =3D __perf_event_create_kernel_counter(&attr, -1, current, > + (intr || pebs) ? > + kvm_perf_overflow_intr : > + kvm_perf_overflow, pmc, true); > if (IS_ERR(event)) { > printk_once("kvm: pmu event creation failed %ld\n", > PTR_ERR(event)); > return; > } > - event->guest_owned =3D true; > =20 > pmc->perf_event =3D event; > clear_bit(pmc->idx, (unsigned long*)&pmc->vcpu->arch.pmu.reprogram_= pmi); > @@ -211,7 +232,8 @@ static unsigned find_arch_event(struct kvm_pmu *p= mu, u8 event_select, > return arch_events[i].event_type; > } > =20 > -static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel) > +static void reprogram_gp_counter(struct kvm_pmu *pmu, struct kvm_pmc= *pmc, > + u64 eventsel) > { > unsigned config, type =3D PERF_TYPE_RAW; > u8 event_select, unit_mask; > @@ -248,7 +270,8 @@ static void reprogram_gp_counter(struct kvm_pmc *= pmc, u64 eventsel) > !(eventsel & ARCH_PERFMON_EVENTSEL_OS), > eventsel & ARCH_PERFMON_EVENTSEL_INT, > (eventsel & HSW_IN_TX), > - (eventsel & HSW_IN_TX_CHECKPOINTED)); > + (eventsel & HSW_IN_TX_CHECKPOINTED), > + test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable)); > } > =20 > static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, = int idx) > @@ -265,7 +288,7 @@ static void reprogram_fixed_counter(struct kvm_pm= c *pmc, u8 en_pmi, int idx) > arch_events[fixed_pmc_events[idx]].event_type, > !(en & 0x2), /* exclude user */ > !(en & 0x1), /* exclude kernel */ > - pmi, false, false); > + pmi, false, false, false); > } > =20 > static inline u8 fixed_en_pmi(u64 ctrl, int idx) > @@ -298,7 +321,7 @@ static void reprogram_idx(struct kvm_pmu *pmu, in= t idx) > return; > =20 > if (pmc_is_gp(pmc)) > - reprogram_gp_counter(pmc, pmc->eventsel); > + reprogram_gp_counter(pmu, pmc, pmc->eventsel); > else { > int fidx =3D idx - INTEL_PMC_IDX_FIXED; > reprogram_fixed_counter(pmc, > @@ -323,6 +346,12 @@ bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr) > int ret; > =20 > switch (msr) { > + case MSR_IA32_DS_AREA: > + case MSR_IA32_PEBS_ENABLE: > + case MSR_IA32_PERF_CAPABILITIES: > + ret =3D perf_pebs_virtualization() ? 1 : 0; > + break; > + > case MSR_CORE_PERF_FIXED_CTR_CTRL: > case MSR_CORE_PERF_GLOBAL_STATUS: > case MSR_CORE_PERF_GLOBAL_CTRL: > @@ -356,6 +385,18 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 i= ndex, u64 *data) > case MSR_CORE_PERF_GLOBAL_OVF_CTRL: > *data =3D pmu->global_ovf_ctrl; > return 0; > + case MSR_IA32_DS_AREA: > + *data =3D pmu->ds_area; > + return 0; > + case MSR_IA32_PEBS_ENABLE: > + *data =3D pmu->pebs_enable; > + return 0; > + case MSR_IA32_PERF_CAPABILITIES: > + /* Report host PEBS format to guest */ > + *data =3D host_perf_cap & > + (PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | > + PERF_CAP_PEBS_FORMAT); > + return 0; > default: > if ((pmc =3D get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) || > (pmc =3D get_fixed_pmc(pmu, index))) { > @@ -369,6 +410,109 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 = index, u64 *data) > return 1; > } > =20 > +static void kvm_pmu_release_pin(struct kvm_vcpu *vcpu) > +{ > + struct kvm_pmu *pmu =3D &vcpu->arch.pmu; > + int i; > + > + for (i =3D 0; i < pmu->num_pinned_pages; i++) > + put_page(pmu->pinned_pages[i]); > + pmu->num_pinned_pages =3D 0; > +} > + > +static struct page *get_guest_page(struct kvm_vcpu *vcpu, > + unsigned long addr) > +{ > + unsigned long pfn; > + struct x86_exception exception; > + gpa_t gpa =3D vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, > + PFERR_WRITE_MASK, > + &exception); > + > + if (gpa =3D=3D UNMAPPED_GVA) { > + printk_once("Cannot translate guest page %lx\n", addr); > + return NULL; > + } > + pfn =3D gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa)); > + if (is_error_noslot_pfn(pfn)) { > + printk_once("gfn_to_pfn failed for %llx\n", gpa); > + return NULL; > + } > + return pfn_to_page(pfn); > +} > + > +static int pin_and_copy(struct kvm_vcpu *vcpu, > + unsigned long addr, void *dst, int len, > + struct page **p) > +{ > + unsigned long offset =3D addr & ~PAGE_MASK; > + void *map; > + > + *p =3D get_guest_page(vcpu, addr); > + if (!*p) > + return -EIO; > + map =3D kmap(*p); > + memcpy(dst, map + offset, len); > + kunmap(map); > + return 0; > +} > + > +/* > + * Pin the DS area and the PEBS buffer while PEBS is active, > + * because the CPU cannot tolerate EPT faults for PEBS updates. > + * > + * We assume that any guest who changes the DS buffer disables > + * PEBS first and does not change the page tables during operation. > + * > + * When the guest violates these assumptions it may crash itself. > + * This is expected to not happen with standard profilers. > + * > + * No need to clean up anything, as the caller will always eventuall= y > + * unpin pages. > + */ > + > +static void kvm_pmu_pebs_pin(struct kvm_vcpu *vcpu) > +{ > + struct kvm_pmu *pmu =3D &vcpu->arch.pmu; > + struct debug_store ds; > + int pg; > + unsigned len; > + unsigned long offset; > + unsigned long addr; > + > + offset =3D pmu->ds_area & ~PAGE_MASK; > + len =3D sizeof(struct debug_store); > + len =3D min_t(unsigned, PAGE_SIZE - offset, len); > + if (pin_and_copy(vcpu, pmu->ds_area, &ds, len, > + &pmu->pinned_pages[0]) < 0) { > + printk_once("Cannot pin ds area %llx\n", pmu->ds_area); > + return; > + } > + pmu->num_pinned_pages++; > + if (len < sizeof(struct debug_store)) { > + if (pin_and_copy(vcpu, pmu->ds_area + len, (void *)&ds + len, > + sizeof(struct debug_store) - len, > + &pmu->pinned_pages[1]) < 0) > + return; > + pmu->num_pinned_pages++; > + } > + > + pg =3D pmu->num_pinned_pages; > + for (addr =3D ds.pebs_buffer_base; > + addr < ds.pebs_absolute_maximum && pg < MAX_PINNED_PAGES; > + addr +=3D PAGE_SIZE, pg++) { > + pmu->pinned_pages[pg] =3D get_guest_page(vcpu, addr); > + if (!pmu->pinned_pages[pg]) { > + printk_once("Cannot pin PEBS buffer %lx (%llx-%llx)\n", > + addr, > + ds.pebs_buffer_base, > + ds.pebs_absolute_maximum); > + break; > + } > + } > + pmu->num_pinned_pages =3D pg; > +} > + > int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info= ) > { > struct kvm_pmu *pmu =3D &vcpu->arch.pmu; > @@ -407,6 +551,20 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struc= t msr_data *msr_info) > return 0; > } > break; > + case MSR_IA32_DS_AREA: > + pmu->ds_area =3D data; > + return 0; > + case MSR_IA32_PEBS_ENABLE: > + if (data & ~0xf0000000fULL) > + break; > + if (data && data !=3D pmu->pebs_enable) { > + kvm_pmu_release_pin(vcpu); > + kvm_pmu_pebs_pin(vcpu); > + } else if (data =3D=3D 0 && pmu->pebs_enable) { > + kvm_pmu_release_pin(vcpu); > + } > + pmu->pebs_enable =3D data; > + return 0; > default: > if ((pmc =3D get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) || > (pmc =3D get_fixed_pmc(pmu, index))) { > @@ -418,7 +576,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct= msr_data *msr_info) > if (data =3D=3D pmc->eventsel) > return 0; > if (!(data & pmu->reserved_bits)) { > - reprogram_gp_counter(pmc, data); > + reprogram_gp_counter(pmu, pmc, data); > return 0; > } > } > @@ -514,6 +672,9 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu) > } > init_irq_work(&pmu->irq_work, trigger_pmi); > kvm_pmu_cpuid_update(vcpu); > + > + if (boot_cpu_has(X86_FEATURE_PDCM)) > + rdmsrl_safe(MSR_IA32_PERF_CAPABILITIES, &host_perf_cap); > } > =20 > void kvm_pmu_reset(struct kvm_vcpu *vcpu) > @@ -538,6 +699,7 @@ void kvm_pmu_reset(struct kvm_vcpu *vcpu) > void kvm_pmu_destroy(struct kvm_vcpu *vcpu) > { > kvm_pmu_reset(vcpu); > + kvm_pmu_release_pin(vcpu); > } > =20 > void kvm_handle_pmu_event(struct kvm_vcpu *vcpu) > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 33e8c02..4f39917 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -7288,6 +7288,12 @@ static void __noclone vmx_vcpu_run(struct kvm_= vcpu *vcpu) > atomic_switch_perf_msrs(vmx); > debugctlmsr =3D get_debugctlmsr(); > =20 > + /* Move this somewhere else? */ > + if (vcpu->arch.pmu.ds_area) > + add_atomic_switch_msr(vmx, MSR_IA32_DS_AREA, > + vcpu->arch.pmu.ds_area, > + perf_get_ds_area()); > + > vmx->__launched =3D vmx->loaded_vmcs->launched; > asm( > /* Store host registers */ > --=20 > 1.9.0 >=20 -- Gleb.