* [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization
@ 2022-08-23 9:32 Like Xu
2022-08-23 9:32 ` [PATCH RESEND v2 1/8] perf/x86/core: Completely disable guest PEBS via guest's global_ctrl Like Xu
` (8 more replies)
0 siblings, 9 replies; 16+ messages in thread
From: Like Xu @ 2022-08-23 9:32 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: Jim Mattson, linux-kernel, kvm
Good well-designed tests can help us find more bugs, especially when
the test steps differ from the Linux kernel behaviour in terms of the
timing of access to virtualized hw resources.
In this patch series, there are three small optimization (006/007/008),
one hardware surprise (001), and most of these fixes have stepped
on my little toes.
Please feel free to run tests, add more or share comments.
Previous:
https://lore.kernel.org/kvm/20220721103549.49543-1-likexu@tencent.com/
https://lore.kernel.org/kvm/20220803130124.72340-1-likexu@tencent.com/
V2 -> V2 RESEND Changelog:
- The "pebs_capable" fix has been merged into tip/perf/tree tree;
- Move the other two AMD vPMU optimization here;
Like Xu (8):
perf/x86/core: Completely disable guest PEBS via guest's global_ctrl
KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask
KVM: x86/pmu: Don't generate PEBS records for emulated instructions
KVM: x86/pmu: Avoid using PEBS perf_events for normal counters
KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()
KVM: x86/pmu: Defer counter emulated overflow via pmc->stale_counter
KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement
amd_*_to_pmc()
KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters
scalability
arch/x86/events/intel/core.c | 3 +-
arch/x86/include/asm/kvm_host.h | 6 +-
arch/x86/kvm/pmu.c | 47 ++++++++-----
arch/x86/kvm/pmu.h | 6 +-
arch/x86/kvm/svm/pmu.c | 116 +++++---------------------------
arch/x86/kvm/vmx/pmu_intel.c | 30 ++++-----
6 files changed, 73 insertions(+), 135 deletions(-)
--
2.37.2
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RESEND v2 1/8] perf/x86/core: Completely disable guest PEBS via guest's global_ctrl
2022-08-23 9:32 [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Like Xu
@ 2022-08-23 9:32 ` Like Xu
2022-08-30 17:40 ` Sean Christopherson
2022-08-23 9:32 ` [PATCH RESEND v2 2/8] KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask Like Xu
` (7 subsequent siblings)
8 siblings, 1 reply; 16+ messages in thread
From: Like Xu @ 2022-08-23 9:32 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: Jim Mattson, linux-kernel, kvm
From: Like Xu <likexu@tencent.com>
When a guest PEBS counter is cross-mapped by a host counter, software
will remove the corresponding bit in the arr[global_ctrl].guest and
expect hardware to perform a change of state "from enable to disable"
via the msr_slot[] switch during the vmx transaction.
The real world is that if user adjust the counter overflow value small
enough, it still opens a tiny race window for the previously PEBS-enabled
counter to write cross-mapped PEBS records into the guest's PEBS buffer,
when arr[global_ctrl].guest has been prioritised (switch_msr_special stuff)
to switch into the enabled state, while the arr[pebs_enable].guest has not.
Close this window by clearing invalid bits in the arr[global_ctrl].guest.
Fixes: 854250329c02 ("KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations")
Signed-off-by: Like Xu <likexu@tencent.com>
---
arch/x86/events/intel/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 2db93498ff71..75cdd11ab014 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4052,8 +4052,9 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
/* Disable guest PEBS if host PEBS is enabled. */
arr[pebs_enable].guest = 0;
} else {
- /* Disable guest PEBS for cross-mapped PEBS counters. */
+ /* Disable guest PEBS thoroughly for cross-mapped PEBS counters. */
arr[pebs_enable].guest &= ~kvm_pmu->host_cross_mapped_mask;
+ arr[global_ctrl].guest &= ~kvm_pmu->host_cross_mapped_mask;
/* Set hw GLOBAL_CTRL bits for PEBS counter when it runs for guest */
arr[global_ctrl].guest |= arr[pebs_enable].guest;
}
--
2.37.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RESEND v2 2/8] KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask
2022-08-23 9:32 [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Like Xu
2022-08-23 9:32 ` [PATCH RESEND v2 1/8] perf/x86/core: Completely disable guest PEBS via guest's global_ctrl Like Xu
@ 2022-08-23 9:32 ` Like Xu
2022-08-23 9:32 ` [PATCH RESEND v2 3/8] KVM: x86/pmu: Don't generate PEBS records for emulated instructions Like Xu
` (6 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Like Xu @ 2022-08-23 9:32 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: Jim Mattson, linux-kernel, kvm
From: Like Xu <likexu@tencent.com>
In the extreme case of host counters multiplexing and contention, the
perf_event requested by the guest's pebs counter is not allocated to any
actual physical counter, in which case hw.idx is bookkept as -1,
resulting in an out-of-bounds access to host_cross_mapped_mask.
Fixes: 854250329c02 ("KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations")
Signed-off-by: Like Xu <likexu@tencent.com>
---
arch/x86/kvm/vmx/pmu_intel.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index c399637a3a79..d595ff33d32d 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -776,20 +776,20 @@ static void intel_pmu_cleanup(struct kvm_vcpu *vcpu)
void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu)
{
struct kvm_pmc *pmc = NULL;
- int bit;
+ int bit, hw_idx;
for_each_set_bit(bit, (unsigned long *)&pmu->global_ctrl,
X86_PMC_IDX_MAX) {
pmc = intel_pmc_idx_to_pmc(pmu, bit);
if (!pmc || !pmc_speculative_in_use(pmc) ||
- !intel_pmc_is_enabled(pmc))
+ !intel_pmc_is_enabled(pmc) || !pmc->perf_event)
continue;
- if (pmc->perf_event && pmc->idx != pmc->perf_event->hw.idx) {
- pmu->host_cross_mapped_mask |=
- BIT_ULL(pmc->perf_event->hw.idx);
- }
+ hw_idx = pmc->perf_event->hw.idx;
+ /* make it a little less dependent on perf's exact behavior */
+ if (hw_idx != pmc->idx && hw_idx > -1)
+ pmu->host_cross_mapped_mask |= BIT_ULL(hw_idx);
}
}
--
2.37.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RESEND v2 3/8] KVM: x86/pmu: Don't generate PEBS records for emulated instructions
2022-08-23 9:32 [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Like Xu
2022-08-23 9:32 ` [PATCH RESEND v2 1/8] perf/x86/core: Completely disable guest PEBS via guest's global_ctrl Like Xu
2022-08-23 9:32 ` [PATCH RESEND v2 2/8] KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask Like Xu
@ 2022-08-23 9:32 ` Like Xu
2022-08-23 9:32 ` [PATCH RESEND v2 4/8] KVM: x86/pmu: Avoid using PEBS perf_events for normal counters Like Xu
` (5 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Like Xu @ 2022-08-23 9:32 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: Jim Mattson, linux-kernel, kvm
From: Like Xu <likexu@tencent.com>
KVM will accumulate an enabled counter for at least INSTRUCTIONS or
BRANCH_INSTRUCTION hw event from any KVM emulated instructions,
generating emulated overflow interrupt on counter overflow, which
in theory should also happen when the PEBS counter overflows but
it currently lacks this part of the underlying support (e.g. through
software injection of records in the irq context or a lazy approach).
In this case, KVM skips the injection of this BUFFER_OVF PMI (effectively
dropping one PEBS record) and let the overflow counter move on. The loss
of a single sample does not introduce a loss of accuracy, but is easily
noticeable for certain specific instructions.
This issue is expected to be addressed along with the issue
of PEBS cross-mapped counters with a slow-path proposal.
Fixes: 79f3e3b58386 ("KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter")
Signed-off-by: Like Xu <likexu@tencent.com>
---
arch/x86/kvm/pmu.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 02f9e4f245bd..390d697efde1 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -106,9 +106,19 @@ static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
return;
if (pmc->perf_event && pmc->perf_event->attr.precise_ip) {
- /* Indicate PEBS overflow PMI to guest. */
- skip_pmi = __test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
- (unsigned long *)&pmu->global_status);
+ if (!in_pmi) {
+ /*
+ * TODO: KVM is currently _choosing_ to not generate records
+ * for emulated instructions, avoiding BUFFER_OVF PMI when
+ * there are no records. Strictly speaking, it should be done
+ * as well in the right context to improve sampling accuracy.
+ */
+ skip_pmi = true;
+ } else {
+ /* Indicate PEBS overflow PMI to guest. */
+ skip_pmi = __test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
+ (unsigned long *)&pmu->global_status);
+ }
} else {
__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
}
--
2.37.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RESEND v2 4/8] KVM: x86/pmu: Avoid using PEBS perf_events for normal counters
2022-08-23 9:32 [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Like Xu
` (2 preceding siblings ...)
2022-08-23 9:32 ` [PATCH RESEND v2 3/8] KVM: x86/pmu: Don't generate PEBS records for emulated instructions Like Xu
@ 2022-08-23 9:32 ` Like Xu
2022-08-23 9:32 ` [PATCH RESEND v2 5/8] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event() Like Xu
` (4 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Like Xu @ 2022-08-23 9:32 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: Jim Mattson, linux-kernel, kvm
From: Like Xu <likexu@tencent.com>
The check logic in the pmc_resume_counter() to determine whether
a perf_event is reusable is partial and flawed, especially when it
comes to a pseudocode sequence (not correct but clearly valid) like:
- enabling a counter and its PEBS bit
- enable global_ctrl
- run workload
- disable only the PEBS bit, leaving the global_ctrl bit enabled
In this corner case, a perf_event created for PEBS can be reused by
a normal counter before it has been released and recreated, and when this
normal counter overflows, it triggers a PEBS interrupt (precise_ip != 0).
To address this issue, the reuse check has been revamped and KVM will
go back to do reprogram_counter() when any bit of guest PEBS_ENABLE
msr has changed, which is similar to what global_ctrl_changed() does.
Fixes: 79f3e3b58386 ("KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter")
Signed-off-by: Like Xu <likexu@tencent.com>
---
arch/x86/kvm/pmu.c | 4 ++--
arch/x86/kvm/vmx/pmu_intel.c | 14 +++++++-------
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 390d697efde1..d9b9a0f0db17 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -237,8 +237,8 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
get_sample_period(pmc, pmc->counter)))
return false;
- if (!test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) &&
- pmc->perf_event->attr.precise_ip)
+ if (test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) !=
+ (!!pmc->perf_event->attr.precise_ip))
return false;
/* reuse perf_event to serve as pmc_reprogram_counter() does*/
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index d595ff33d32d..6242b0b81116 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -68,15 +68,11 @@ static struct kvm_pmc *intel_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx)
}
}
-/* function is called when global control register has been updated. */
-static void global_ctrl_changed(struct kvm_pmu *pmu, u64 data)
+static void reprogram_counters(struct kvm_pmu *pmu, u64 diff)
{
int bit;
- u64 diff = pmu->global_ctrl ^ data;
struct kvm_pmc *pmc;
- pmu->global_ctrl = data;
-
for_each_set_bit(bit, (unsigned long *)&diff, X86_PMC_IDX_MAX) {
pmc = intel_pmc_idx_to_pmc(pmu, bit);
if (pmc)
@@ -397,7 +393,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
struct kvm_pmc *pmc;
u32 msr = msr_info->index;
u64 data = msr_info->data;
- u64 reserved_bits;
+ u64 reserved_bits, diff;
switch (msr) {
case MSR_CORE_PERF_FIXED_CTR_CTRL:
@@ -418,7 +414,9 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (pmu->global_ctrl == data)
return 0;
if (kvm_valid_perf_global_ctrl(pmu, data)) {
- global_ctrl_changed(pmu, data);
+ diff = pmu->global_ctrl ^ data;
+ pmu->global_ctrl = data;
+ reprogram_counters(pmu, diff);
return 0;
}
break;
@@ -433,7 +431,9 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (pmu->pebs_enable == data)
return 0;
if (!(data & pmu->pebs_enable_mask)) {
+ diff = pmu->pebs_enable ^ data;
pmu->pebs_enable = data;
+ reprogram_counters(pmu, diff);
return 0;
}
break;
--
2.37.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RESEND v2 5/8] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()
2022-08-23 9:32 [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Like Xu
` (3 preceding siblings ...)
2022-08-23 9:32 ` [PATCH RESEND v2 4/8] KVM: x86/pmu: Avoid using PEBS perf_events for normal counters Like Xu
@ 2022-08-23 9:32 ` Like Xu
2022-08-30 17:50 ` Sean Christopherson
2022-08-23 9:32 ` [PATCH RESEND v2 6/8] KVM: x86/pmu: Defer counter emulated overflow via pmc->stale_counter Like Xu
` (3 subsequent siblings)
8 siblings, 1 reply; 16+ messages in thread
From: Like Xu @ 2022-08-23 9:32 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: Jim Mattson, linux-kernel, kvm
From: Like Xu <likexu@tencent.com>
During a KVM-trap from vm-exit to vm-entry, requests from different
sources will try to create one or more perf_events via reprogram_counter(),
which will allow some predecessor actions to be undone posteriorly,
especially repeated calls to some perf subsystem interfaces. These
repetitive calls can be omitted because only the final state of the
perf_event and the hardware resources it occupies will take effect
for the guest right before the vm-entry.
To realize this optimization, KVM marks the creation requirements via
an inline version of reprogram_counter(), and then defers the actual
execution with the help of vcpu KVM_REQ_PMU request.
Opportunistically update related comments to avoid misunderstandings.
Signed-off-by: Like Xu <likexu@tencent.com>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/pmu.c | 16 +++++++++-------
arch/x86/kvm/pmu.h | 6 +++++-
3 files changed, 15 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c96c43c313a..4e568a7ef464 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -493,6 +493,7 @@ struct kvm_pmc {
struct perf_event *perf_event;
struct kvm_vcpu *vcpu;
/*
+ * only for creating or reusing perf_event,
* eventsel value for general purpose counters,
* ctrl value for fixed counters.
*/
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index d9b9a0f0db17..6940cbeee54d 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -101,7 +101,7 @@ static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
bool skip_pmi = false;
- /* Ignore counters that have been reprogrammed already. */
+ /* Ignore counters that have not been reprogrammed. */
if (test_and_set_bit(pmc->idx, pmu->reprogram_pmi))
return;
@@ -293,7 +293,7 @@ static bool check_pmu_event_filter(struct kvm_pmc *pmc)
return allow_event;
}
-void reprogram_counter(struct kvm_pmc *pmc)
+static void __reprogram_counter(struct kvm_pmc *pmc)
{
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
u64 eventsel = pmc->eventsel;
@@ -335,7 +335,6 @@ void reprogram_counter(struct kvm_pmc *pmc)
!(eventsel & ARCH_PERFMON_EVENTSEL_OS),
eventsel & ARCH_PERFMON_EVENTSEL_INT);
}
-EXPORT_SYMBOL_GPL(reprogram_counter);
void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
{
@@ -345,11 +344,12 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
for_each_set_bit(bit, pmu->reprogram_pmi, X86_PMC_IDX_MAX) {
struct kvm_pmc *pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, bit);
- if (unlikely(!pmc || !pmc->perf_event)) {
+ if (unlikely(!pmc)) {
clear_bit(bit, pmu->reprogram_pmi);
continue;
}
- reprogram_counter(pmc);
+
+ __reprogram_counter(pmc);
}
/*
@@ -527,7 +527,7 @@ static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
prev_count = pmc->counter;
pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
- reprogram_counter(pmc);
+ __reprogram_counter(pmc);
if (pmc->counter < prev_count)
__kvm_perf_overflow(pmc, false);
}
@@ -542,7 +542,9 @@ static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc,
static inline bool cpl_is_matched(struct kvm_pmc *pmc)
{
bool select_os, select_user;
- u64 config = pmc->current_config;
+ u64 config = pmc_is_gp(pmc) ? pmc->eventsel :
+ (u64)fixed_ctrl_field(pmc_to_pmu(pmc)->fixed_ctr_ctrl,
+ pmc->idx - INTEL_PMC_IDX_FIXED);
if (pmc_is_gp(pmc)) {
select_os = config & ARCH_PERFMON_EVENTSEL_OS;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 5cc5721f260b..d193d1dc6de0 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -183,7 +183,11 @@ static inline void kvm_init_pmu_capability(void)
KVM_PMC_MAX_FIXED);
}
-void reprogram_counter(struct kvm_pmc *pmc);
+static inline void reprogram_counter(struct kvm_pmc *pmc)
+{
+ __set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
+ kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
+}
void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
--
2.37.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RESEND v2 6/8] KVM: x86/pmu: Defer counter emulated overflow via pmc->stale_counter
2022-08-23 9:32 [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Like Xu
` (4 preceding siblings ...)
2022-08-23 9:32 ` [PATCH RESEND v2 5/8] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event() Like Xu
@ 2022-08-23 9:32 ` Like Xu
2022-08-30 17:59 ` Sean Christopherson
2022-08-23 9:32 ` [PATCH RESEND v2 7/8] KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement amd_*_to_pmc() Like Xu
` (2 subsequent siblings)
8 siblings, 1 reply; 16+ messages in thread
From: Like Xu @ 2022-08-23 9:32 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini
Cc: Jim Mattson, linux-kernel, kvm, Wanpeng Li
From: Like Xu <likexu@tencent.com>
There are contextual restrictions on the functions that can be called
in the *_exit_handlers_fastpath path, for example calling
pmc_reprogram_counter() brings up a host complaint like:
[*] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
[*] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM
[*] preempt_count: 1, expected: 0
[*] RCU nest depth: 0, expected: 0
[*] INFO: lockdep is turned off.
[*] irq event stamp: 0
[*] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[*] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0
[*] softirqs last enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0
[*] softirqs last disabled at (0): [<0000000000000000>] 0x0
[*] Preemption disabled at:
[*] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm]
[*] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2
[*] Call Trace:
[*] <TASK>
[*] dump_stack_lvl+0x6c/0x9b
[*] __might_resched.cold+0x22e/0x297
[*] __mutex_lock+0xc0/0x23b0
[*] perf_event_ctx_lock_nested+0x18f/0x340
[*] perf_event_pause+0x1a/0x110
[*] reprogram_counter+0x2af/0x1490 [kvm]
[*] kvm_pmu_trigger_event+0x429/0x950 [kvm]
[*] kvm_skip_emulated_instruction+0x48/0x90 [kvm]
[*] handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm]
[*] vmx_vcpu_run+0x268e/0x3b80 [kvm_intel]
[*] vcpu_enter_guest+0x1d22/0x3dc0 [kvm]
A new stale_counter field is introduced to keep this part of the semantics
invariant. It records the current counter value and it's used to determine
whether to inject an emulated overflow interrupt in the later
kvm_pmu_handle_event(), given that the internal count value from its
perf_event has not been added to pmc->counter in time, or the guest
will update the value of a running counter directly.
Opportunistically shrink sizeof(struct kvm_pmc) a bit.
Suggested-by: Wanpeng Li <wanpengli@tencent.com>
Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Like Xu <likexu@tencent.com>
---
arch/x86/include/asm/kvm_host.h | 5 +++--
arch/x86/kvm/pmu.c | 15 ++++++++-------
arch/x86/kvm/svm/pmu.c | 2 +-
arch/x86/kvm/vmx/pmu_intel.c | 4 ++--
4 files changed, 14 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4e568a7ef464..ffd982bf015d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -488,7 +488,10 @@ enum pmc_type {
struct kvm_pmc {
enum pmc_type type;
u8 idx;
+ bool is_paused;
+ bool intr;
u64 counter;
+ u64 stale_counter;
u64 eventsel;
struct perf_event *perf_event;
struct kvm_vcpu *vcpu;
@@ -498,8 +501,6 @@ struct kvm_pmc {
* ctrl value for fixed counters.
*/
u64 current_config;
- bool is_paused;
- bool intr;
};
#define KVM_PMC_MAX_FIXED 3
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 6940cbeee54d..45d062cb1dd5 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -350,6 +350,12 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
}
__reprogram_counter(pmc);
+
+ if (pmc->stale_counter) {
+ if (pmc->counter < pmc->stale_counter)
+ __kvm_perf_overflow(pmc, false);
+ pmc->stale_counter = 0;
+ }
}
/*
@@ -522,14 +528,9 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
{
- u64 prev_count;
-
- prev_count = pmc->counter;
+ pmc->stale_counter = pmc->counter;
pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
-
- __reprogram_counter(pmc);
- if (pmc->counter < prev_count)
- __kvm_perf_overflow(pmc, false);
+ reprogram_counter(pmc);
}
static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc,
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index f24613a108c5..e9c66dd659a6 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -290,7 +290,7 @@ static void amd_pmu_reset(struct kvm_vcpu *vcpu)
struct kvm_pmc *pmc = &pmu->gp_counters[i];
pmc_stop_counter(pmc);
- pmc->counter = pmc->eventsel = 0;
+ pmc->counter = pmc->stale_counter = pmc->eventsel = 0;
}
}
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 6242b0b81116..42b591755010 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -647,14 +647,14 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmc = &pmu->gp_counters[i];
pmc_stop_counter(pmc);
- pmc->counter = pmc->eventsel = 0;
+ pmc->counter = pmc->stale_counter = pmc->eventsel = 0;
}
for (i = 0; i < KVM_PMC_MAX_FIXED; i++) {
pmc = &pmu->fixed_counters[i];
pmc_stop_counter(pmc);
- pmc->counter = 0;
+ pmc->counter = pmc->stale_counter = 0;
}
pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
--
2.37.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RESEND v2 7/8] KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement amd_*_to_pmc()
2022-08-23 9:32 [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Like Xu
` (5 preceding siblings ...)
2022-08-23 9:32 ` [PATCH RESEND v2 6/8] KVM: x86/pmu: Defer counter emulated overflow via pmc->stale_counter Like Xu
@ 2022-08-23 9:32 ` Like Xu
2022-08-30 18:07 ` Sean Christopherson
2022-08-23 9:32 ` [PATCH RESEND v2 8/8] KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters scalability Like Xu
2022-08-30 17:29 ` [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Sean Christopherson
8 siblings, 1 reply; 16+ messages in thread
From: Like Xu @ 2022-08-23 9:32 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: Jim Mattson, linux-kernel, kvm
From: Like Xu <likexu@tencent.com>
AMD only has gp counters, whose corresponding vPMCs are initialised
and stored in pmu->gp_counter[] in order of idx, so we can access this
array directly based on any valid pmc->idx, without any help from other
interfaces at all. The amd_rdpmc_ecx_to_pmc() can now reuse this part
of the code quite naturally.
Opportunistically apply array_index_nospec() to reduce the attack
surface for speculative execution and remove the dead code.
Signed-off-by: Like Xu <likexu@tencent.com>
---
arch/x86/kvm/svm/pmu.c | 41 +++++------------------------------------
1 file changed, 5 insertions(+), 36 deletions(-)
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index e9c66dd659a6..e57eb0555a04 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -33,23 +33,6 @@ enum index {
INDEX_ERROR,
};
-static unsigned int get_msr_base(struct kvm_pmu *pmu, enum pmu_type type)
-{
- struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu);
-
- if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
- if (type == PMU_TYPE_COUNTER)
- return MSR_F15H_PERF_CTR;
- else
- return MSR_F15H_PERF_CTL;
- } else {
- if (type == PMU_TYPE_COUNTER)
- return MSR_K7_PERFCTR0;
- else
- return MSR_K7_EVNTSEL0;
- }
-}
-
static enum index msr_to_index(u32 msr)
{
switch (msr) {
@@ -141,18 +124,12 @@ static bool amd_pmc_is_enabled(struct kvm_pmc *pmc)
static struct kvm_pmc *amd_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx)
{
- unsigned int base = get_msr_base(pmu, PMU_TYPE_COUNTER);
- struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu);
+ unsigned int num_counters = pmu->nr_arch_gp_counters;
- if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
- /*
- * The idx is contiguous. The MSRs are not. The counter MSRs
- * are interleaved with the event select MSRs.
- */
- pmc_idx *= 2;
- }
+ if (pmc_idx >= num_counters)
+ return NULL;
- return get_gp_pmc_amd(pmu, base + pmc_idx, PMU_TYPE_COUNTER);
+ return &pmu->gp_counters[array_index_nospec(pmc_idx, num_counters)];
}
static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx)
@@ -168,15 +145,7 @@ static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx)
static struct kvm_pmc *amd_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
unsigned int idx, u64 *mask)
{
- struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
- struct kvm_pmc *counters;
-
- idx &= ~(3u << 30);
- if (idx >= pmu->nr_arch_gp_counters)
- return NULL;
- counters = pmu->gp_counters;
-
- return &counters[idx];
+ return amd_pmc_idx_to_pmc(vcpu_to_pmu(vcpu), idx & ~(3u << 30));
}
static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
--
2.37.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RESEND v2 8/8] KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters scalability
2022-08-23 9:32 [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Like Xu
` (6 preceding siblings ...)
2022-08-23 9:32 ` [PATCH RESEND v2 7/8] KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement amd_*_to_pmc() Like Xu
@ 2022-08-23 9:32 ` Like Xu
2022-08-30 18:24 ` Sean Christopherson
2022-08-30 17:29 ` [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Sean Christopherson
8 siblings, 1 reply; 16+ messages in thread
From: Like Xu @ 2022-08-23 9:32 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: Jim Mattson, linux-kernel, kvm
From: Like Xu <likexu@tencent.com>
If the number of AMD gp counters continues to grow, the code will
be very clumsy and the switch-case design of inline get_gp_pmc_amd()
will also bloat the kernel text size.
The target code is taught to manage two groups of MSRs, each
representing a different version of the AMD PMU counter MSRs.
The MSR addresses of each group are contiguous, with no holes,
and there is no intersection between two sets of addresses,
but they are discrete in functionality by design like this:
[Group A : All counter MSRs are tightly bound to all event select MSRs ]
MSR_K7_EVNTSEL0 0xc0010000
MSR_K7_EVNTSELi 0xc0010000 + i
...
MSR_K7_EVNTSEL3 0xc0010003
MSR_K7_PERFCTR0 0xc0010004
MSR_K7_PERFCTRi 0xc0010004 + i
...
MSR_K7_PERFCTR3 0xc0010007
[Group B : The counter MSRs are interleaved with the event select MSRs ]
MSR_F15H_PERF_CTL0 0xc0010200
MSR_F15H_PERF_CTR0 (0xc0010200 + 1)
...
MSR_F15H_PERF_CTLi (0xc0010200 + 2 * i)
MSR_F15H_PERF_CTRi (0xc0010200 + 2 * i + 1)
...
MSR_F15H_PERF_CTL5 (0xc0010200 + 2 * 5)
MSR_F15H_PERF_CTR5 (0xc0010200 + 2 * 5 + 1)
Rewrite get_gp_pmc_amd() in this way: first determine which group of
registers is accessed, then determine if it matches its requested type,
applying different scaling ratios respectively, and finally get pmc_idx
to pass into amd_pmc_idx_to_pmc().
Signed-off-by: Like Xu <likexu@tencent.com>
---
arch/x86/kvm/svm/pmu.c | 85 +++++++++---------------------------------
1 file changed, 17 insertions(+), 68 deletions(-)
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index e57eb0555a04..c7ff6a910679 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -23,90 +23,49 @@ enum pmu_type {
PMU_TYPE_EVNTSEL,
};
-enum index {
- INDEX_ZERO = 0,
- INDEX_ONE,
- INDEX_TWO,
- INDEX_THREE,
- INDEX_FOUR,
- INDEX_FIVE,
- INDEX_ERROR,
-};
-
-static enum index msr_to_index(u32 msr)
+static struct kvm_pmc *amd_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx)
{
- switch (msr) {
- case MSR_F15H_PERF_CTL0:
- case MSR_F15H_PERF_CTR0:
- case MSR_K7_EVNTSEL0:
- case MSR_K7_PERFCTR0:
- return INDEX_ZERO;
- case MSR_F15H_PERF_CTL1:
- case MSR_F15H_PERF_CTR1:
- case MSR_K7_EVNTSEL1:
- case MSR_K7_PERFCTR1:
- return INDEX_ONE;
- case MSR_F15H_PERF_CTL2:
- case MSR_F15H_PERF_CTR2:
- case MSR_K7_EVNTSEL2:
- case MSR_K7_PERFCTR2:
- return INDEX_TWO;
- case MSR_F15H_PERF_CTL3:
- case MSR_F15H_PERF_CTR3:
- case MSR_K7_EVNTSEL3:
- case MSR_K7_PERFCTR3:
- return INDEX_THREE;
- case MSR_F15H_PERF_CTL4:
- case MSR_F15H_PERF_CTR4:
- return INDEX_FOUR;
- case MSR_F15H_PERF_CTL5:
- case MSR_F15H_PERF_CTR5:
- return INDEX_FIVE;
- default:
- return INDEX_ERROR;
- }
+ unsigned int num_counters = pmu->nr_arch_gp_counters;
+
+ if (pmc_idx >= num_counters)
+ return NULL;
+
+ return &pmu->gp_counters[array_index_nospec(pmc_idx, num_counters)];
}
static inline struct kvm_pmc *get_gp_pmc_amd(struct kvm_pmu *pmu, u32 msr,
enum pmu_type type)
{
struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu);
+ unsigned int idx;
if (!vcpu->kvm->arch.enable_pmu)
return NULL;
switch (msr) {
- case MSR_F15H_PERF_CTL0:
- case MSR_F15H_PERF_CTL1:
- case MSR_F15H_PERF_CTL2:
- case MSR_F15H_PERF_CTL3:
- case MSR_F15H_PERF_CTL4:
- case MSR_F15H_PERF_CTL5:
+ case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTR5:
if (!guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE))
return NULL;
- fallthrough;
+ idx = (unsigned int)((msr - MSR_F15H_PERF_CTL0) / 2);
+ if ((msr == (MSR_F15H_PERF_CTL0 + 2 * idx)) !=
+ (type == PMU_TYPE_EVNTSEL))
+ return NULL;
+ break;
case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL3:
if (type != PMU_TYPE_EVNTSEL)
return NULL;
+ idx = msr - MSR_K7_EVNTSEL0;
break;
- case MSR_F15H_PERF_CTR0:
- case MSR_F15H_PERF_CTR1:
- case MSR_F15H_PERF_CTR2:
- case MSR_F15H_PERF_CTR3:
- case MSR_F15H_PERF_CTR4:
- case MSR_F15H_PERF_CTR5:
- if (!guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE))
- return NULL;
- fallthrough;
case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3:
if (type != PMU_TYPE_COUNTER)
return NULL;
+ idx = msr - MSR_K7_PERFCTR0;
break;
default:
return NULL;
}
- return &pmu->gp_counters[msr_to_index(msr)];
+ return amd_pmc_idx_to_pmc(pmu, idx);
}
static bool amd_hw_event_available(struct kvm_pmc *pmc)
@@ -122,16 +81,6 @@ static bool amd_pmc_is_enabled(struct kvm_pmc *pmc)
return true;
}
-static struct kvm_pmc *amd_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx)
-{
- unsigned int num_counters = pmu->nr_arch_gp_counters;
-
- if (pmc_idx >= num_counters)
- return NULL;
-
- return &pmu->gp_counters[array_index_nospec(pmc_idx, num_counters)];
-}
-
static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
--
2.37.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization
2022-08-23 9:32 [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Like Xu
` (7 preceding siblings ...)
2022-08-23 9:32 ` [PATCH RESEND v2 8/8] KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters scalability Like Xu
@ 2022-08-30 17:29 ` Sean Christopherson
2022-08-31 8:05 ` Like Xu
8 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2022-08-30 17:29 UTC (permalink / raw)
To: Like Xu; +Cc: Paolo Bonzini, Jim Mattson, linux-kernel, kvm
On Tue, Aug 23, 2022, Like Xu wrote:
> Good well-designed tests can help us find more bugs, especially when
> the test steps differ from the Linux kernel behaviour in terms of the
> timing of access to virtualized hw resources.
>
> In this patch series, there are three small optimization (006/007/008),
> one hardware surprise (001), and most of these fixes have stepped
> on my little toes.
>
> Please feel free to run tests, add more or share comments.
>
> Previous:
> https://lore.kernel.org/kvm/20220721103549.49543-1-likexu@tencent.com/
> https://lore.kernel.org/kvm/20220803130124.72340-1-likexu@tencent.com/
>
> V2 -> V2 RESEND Changelog:
> - The "pebs_capable" fix has been merged into tip/perf/tree tree;
> - Move the other two AMD vPMU optimization here;
This is not a RESEND. These things very much matter because I'm sitting here
trying to figure out whether I need to review this "v2", which is really v3, or
whether I can just review the real v2.
Don't add "RESEND" when you are submitting a modified version of your
patch or patch series - "RESEND" only applies to resubmission of a
patch or patch series which have not been modified in any way from the
previous submission.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v2 1/8] perf/x86/core: Completely disable guest PEBS via guest's global_ctrl
2022-08-23 9:32 ` [PATCH RESEND v2 1/8] perf/x86/core: Completely disable guest PEBS via guest's global_ctrl Like Xu
@ 2022-08-30 17:40 ` Sean Christopherson
0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2022-08-30 17:40 UTC (permalink / raw)
To: Like Xu; +Cc: Paolo Bonzini, Jim Mattson, linux-kernel, kvm
On Tue, Aug 23, 2022, Like Xu wrote:
> From: Like Xu <likexu@tencent.com>
>
> When a guest PEBS counter is cross-mapped by a host counter, software
> will remove the corresponding bit in the arr[global_ctrl].guest and
> expect hardware to perform a change of state "from enable to disable"
> via the msr_slot[] switch during the vmx transaction.
>
> The real world is that if user adjust the counter overflow value small
> enough, it still opens a tiny race window for the previously PEBS-enabled
> counter to write cross-mapped PEBS records into the guest's PEBS buffer,
> when arr[global_ctrl].guest has been prioritised (switch_msr_special stuff)
> to switch into the enabled state, while the arr[pebs_enable].guest has not.
>
> Close this window by clearing invalid bits in the arr[global_ctrl].guest.
>
> Fixes: 854250329c02 ("KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations")
> Signed-off-by: Like Xu <likexu@tencent.com>
> ---
> arch/x86/events/intel/core.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 2db93498ff71..75cdd11ab014 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4052,8 +4052,9 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> /* Disable guest PEBS if host PEBS is enabled. */
> arr[pebs_enable].guest = 0;
> } else {
> - /* Disable guest PEBS for cross-mapped PEBS counters. */
> + /* Disable guest PEBS thoroughly for cross-mapped PEBS counters. */
> arr[pebs_enable].guest &= ~kvm_pmu->host_cross_mapped_mask;
> + arr[global_ctrl].guest &= ~kvm_pmu->host_cross_mapped_mask;
> /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs for guest */
> arr[global_ctrl].guest |= arr[pebs_enable].guest;
> }
Please post this as a separate patch to the perf folks (Cc: kvm@).
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v2 5/8] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()
2022-08-23 9:32 ` [PATCH RESEND v2 5/8] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event() Like Xu
@ 2022-08-30 17:50 ` Sean Christopherson
0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2022-08-30 17:50 UTC (permalink / raw)
To: Like Xu; +Cc: Paolo Bonzini, Jim Mattson, linux-kernel, kvm
On Tue, Aug 23, 2022, Like Xu wrote:
> From: Like Xu <likexu@tencent.com>
>
> During a KVM-trap from vm-exit to vm-entry, requests from different
> sources will try to create one or more perf_events via reprogram_counter(),
> which will allow some predecessor actions to be undone posteriorly,
> especially repeated calls to some perf subsystem interfaces. These
> repetitive calls can be omitted because only the final state of the
> perf_event and the hardware resources it occupies will take effect
> for the guest right before the vm-entry.
>
> To realize this optimization, KVM marks the creation requirements via
> an inline version of reprogram_counter(), and then defers the actual
> execution with the help of vcpu KVM_REQ_PMU request.
Use imperative mood and state what change is being made, not what KVM's behavior
is as a result of the change.
And this is way more complicated than it needs to be, and it also neglects to
call out that the deferred logic is needed for a bug fix. IIUC:
Batch reprogramming PMU counters by setting KVM_REQ_PMU and thus deferring
reprogramming kvm_pmu_handle_event() to avoid reprogramming a counter
multiple times during a single VM-Exit.
Deferring programming will also allow KVM to fix a bug where immediately
reprogramming a counter can result in sleeping (taking a mutex) while
interrupts are disabled in the VM-Exit fastpath.
> Opportunistically update related comments to avoid misunderstandings.
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index d9b9a0f0db17..6940cbeee54d 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -101,7 +101,7 @@ static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
> struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> bool skip_pmi = false;
>
> - /* Ignore counters that have been reprogrammed already. */
> + /* Ignore counters that have not been reprogrammed. */
Eh, just drop this comment, it's fairly obvious what the code is doing and your
suggested comment is wrong in the sense that the counters haven't actually been
reprogrammed, i.e. it should be:
/* Ignore counters that don't need to be reprogrammed. */
but IMO that's pretty obvious.
> if (test_and_set_bit(pmc->idx, pmu->reprogram_pmi))
> return;
>
> @@ -293,7 +293,7 @@ static bool check_pmu_event_filter(struct kvm_pmc *pmc)
> return allow_event;
> }
>
> -void reprogram_counter(struct kvm_pmc *pmc)
> +static void __reprogram_counter(struct kvm_pmc *pmc)
This is misleading. Double-underscore variants are usually inner helpers, whereas
these have a different relationship.
Instaed of renaming reprogram_counter(), how about introcuing
kvm_pmu_request_counter_reprogam()
to make it obvious that KVM is _requesting_ a reprogram and not actually doing
the reprogram.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v2 6/8] KVM: x86/pmu: Defer counter emulated overflow via pmc->stale_counter
2022-08-23 9:32 ` [PATCH RESEND v2 6/8] KVM: x86/pmu: Defer counter emulated overflow via pmc->stale_counter Like Xu
@ 2022-08-30 17:59 ` Sean Christopherson
0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2022-08-30 17:59 UTC (permalink / raw)
To: Like Xu; +Cc: Paolo Bonzini, Jim Mattson, linux-kernel, kvm, Wanpeng Li
On Tue, Aug 23, 2022, Like Xu wrote:
> From: Like Xu <likexu@tencent.com>
>
> There are contextual restrictions on the functions that can be called
> in the *_exit_handlers_fastpath path, for example calling
> pmc_reprogram_counter() brings up a host complaint like:
State the actual problem instead of forcing the reader to decipher that from the
stacktrace.
> [*] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
> [*] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM
> [*] preempt_count: 1, expected: 0
> [*] RCU nest depth: 0, expected: 0
> [*] INFO: lockdep is turned off.
> [*] irq event stamp: 0
> [*] hardirqs last enabled at (0): [<0000000000000000>] 0x0
> [*] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0
> [*] softirqs last enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0
> [*] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [*] Preemption disabled at:
> [*] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm]
> [*] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2
> [*] Call Trace:
> [*] <TASK>
> [*] dump_stack_lvl+0x6c/0x9b
> [*] __might_resched.cold+0x22e/0x297
> [*] __mutex_lock+0xc0/0x23b0
> [*] perf_event_ctx_lock_nested+0x18f/0x340
> [*] perf_event_pause+0x1a/0x110
> [*] reprogram_counter+0x2af/0x1490 [kvm]
> [*] kvm_pmu_trigger_event+0x429/0x950 [kvm]
> [*] kvm_skip_emulated_instruction+0x48/0x90 [kvm]
> [*] handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm]
> [*] vmx_vcpu_run+0x268e/0x3b80 [kvm_intel]
> [*] vcpu_enter_guest+0x1d22/0x3dc0 [kvm]
>
> A new stale_counter field is introduced to keep this part of the semantics
> invariant. It records the current counter value and it's used to determine
> whether to inject an emulated overflow interrupt in the later
> kvm_pmu_handle_event(), given that the internal count value from its
> perf_event has not been added to pmc->counter in time, or the guest
> will update the value of a running counter directly.
Describe what the change is at a high level, don't give a play-by-play of the
code changes.
Defer reprogramming counters and handling overflow via KVM_REQ_PMU
when incrementing counters. KVM skips emulated WRMSR in the VM-Exit
fastpath, the fastpath runs with IRQs disabled, skipping instructions
can increment and reprogram counters, reprogramming counters can
sleep, and sleeping is disallowed while IRQs are disabled.
<stack trace>
Add a field to kvm_pmc to track the previous counter value in order
to defer overflow detection to kvm_pmu_handle_event() (reprogramming
must be done before handling overflow).
> Opportunistically shrink sizeof(struct kvm_pmc) a bit.
>
> Suggested-by: Wanpeng Li <wanpengli@tencent.com>
> Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> Signed-off-by: Like Xu <likexu@tencent.com>
> ---
> arch/x86/include/asm/kvm_host.h | 5 +++--
> arch/x86/kvm/pmu.c | 15 ++++++++-------
> arch/x86/kvm/svm/pmu.c | 2 +-
> arch/x86/kvm/vmx/pmu_intel.c | 4 ++--
> 4 files changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4e568a7ef464..ffd982bf015d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -488,7 +488,10 @@ enum pmc_type {
> struct kvm_pmc {
> enum pmc_type type;
> u8 idx;
> + bool is_paused;
> + bool intr;
> u64 counter;
> + u64 stale_counter;
Use "prev_counter", "stale" makes it sound like a flag, e.g. "this counter is
stale".
> u64 eventsel;
> struct perf_event *perf_event;
> struct kvm_vcpu *vcpu;
> @@ -498,8 +501,6 @@ struct kvm_pmc {
> * ctrl value for fixed counters.
> */
> u64 current_config;
> - bool is_paused;
> - bool intr;
> };
>
> #define KVM_PMC_MAX_FIXED 3
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 6940cbeee54d..45d062cb1dd5 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -350,6 +350,12 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
> }
>
> __reprogram_counter(pmc);
> +
> + if (pmc->stale_counter) {
This check is unnecessary. The values are unsigned, so counter can't be less than
the previous value if the previous value was '0'.
> + if (pmc->counter < pmc->stale_counter)
> + __kvm_perf_overflow(pmc, false);
> + pmc->stale_counter = 0;
> + }
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v2 7/8] KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement amd_*_to_pmc()
2022-08-23 9:32 ` [PATCH RESEND v2 7/8] KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement amd_*_to_pmc() Like Xu
@ 2022-08-30 18:07 ` Sean Christopherson
0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2022-08-30 18:07 UTC (permalink / raw)
To: Like Xu; +Cc: Paolo Bonzini, Jim Mattson, linux-kernel, kvm
On Tue, Aug 23, 2022, Like Xu wrote:
> From: Like Xu <likexu@tencent.com>
>
> AMD only has gp counters, whose corresponding vPMCs are initialised
> and stored in pmu->gp_counter[] in order of idx, so we can access this
Avoid pronouns, and state what the patch is doing, not what it _can_ do. IIUC:
Access PMU counters on AMD by directly indexing the array of general
purpose counters instead of translating the PMC index to an MSR index.
AMD only supports gp counters, there's no need to translate a PMC index
to an MSR index and back to a PMC index.
> array directly based on any valid pmc->idx, without any help from other
> interfaces at all. The amd_rdpmc_ecx_to_pmc() can now reuse this part
> of the code quite naturally.
>
> Opportunistically apply array_index_nospec() to reduce the attack
> surface for speculative execution and remove the dead code.
>
> Signed-off-by: Like Xu <likexu@tencent.com>
> ---
> arch/x86/kvm/svm/pmu.c | 41 +++++------------------------------------
> 1 file changed, 5 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index e9c66dd659a6..e57eb0555a04 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -33,23 +33,6 @@ enum index {
> INDEX_ERROR,
> };
>
> -static unsigned int get_msr_base(struct kvm_pmu *pmu, enum pmu_type type)
> -{
> - struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu);
> -
> - if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
> - if (type == PMU_TYPE_COUNTER)
> - return MSR_F15H_PERF_CTR;
> - else
> - return MSR_F15H_PERF_CTL;
> - } else {
> - if (type == PMU_TYPE_COUNTER)
> - return MSR_K7_PERFCTR0;
> - else
> - return MSR_K7_EVNTSEL0;
> - }
> -}
> -
> static enum index msr_to_index(u32 msr)
> {
> switch (msr) {
> @@ -141,18 +124,12 @@ static bool amd_pmc_is_enabled(struct kvm_pmc *pmc)
>
> static struct kvm_pmc *amd_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx)
> {
> - unsigned int base = get_msr_base(pmu, PMU_TYPE_COUNTER);
> - struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu);
> + unsigned int num_counters = pmu->nr_arch_gp_counters;
>
> - if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
> - /*
> - * The idx is contiguous. The MSRs are not. The counter MSRs
> - * are interleaved with the event select MSRs.
> - */
> - pmc_idx *= 2;
> - }
> + if (pmc_idx >= num_counters)
> + return NULL;
>
> - return get_gp_pmc_amd(pmu, base + pmc_idx, PMU_TYPE_COUNTER);
> + return &pmu->gp_counters[array_index_nospec(pmc_idx, num_counters)];
> }
>
> static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx)
> @@ -168,15 +145,7 @@ static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx)
> static struct kvm_pmc *amd_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
> unsigned int idx, u64 *mask)
> {
> - struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> - struct kvm_pmc *counters;
> -
> - idx &= ~(3u << 30);
> - if (idx >= pmu->nr_arch_gp_counters)
> - return NULL;
> - counters = pmu->gp_counters;
> -
> - return &counters[idx];
> + return amd_pmc_idx_to_pmc(vcpu_to_pmu(vcpu), idx & ~(3u << 30));
> }
>
> static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
> --
> 2.37.2
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v2 8/8] KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters scalability
2022-08-23 9:32 ` [PATCH RESEND v2 8/8] KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters scalability Like Xu
@ 2022-08-30 18:24 ` Sean Christopherson
0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2022-08-30 18:24 UTC (permalink / raw)
To: Like Xu; +Cc: Paolo Bonzini, Jim Mattson, linux-kernel, kvm
On Tue, Aug 23, 2022, Like Xu wrote:
> static inline struct kvm_pmc *get_gp_pmc_amd(struct kvm_pmu *pmu, u32 msr,
> enum pmu_type type)
> {
> struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu);
> + unsigned int idx;
>
> if (!vcpu->kvm->arch.enable_pmu)
> return NULL;
>
> switch (msr) {
> - case MSR_F15H_PERF_CTL0:
> - case MSR_F15H_PERF_CTL1:
> - case MSR_F15H_PERF_CTL2:
> - case MSR_F15H_PERF_CTL3:
> - case MSR_F15H_PERF_CTL4:
> - case MSR_F15H_PERF_CTL5:
> + case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTR5:
> if (!guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE))
> return NULL;
> - fallthrough;
> + idx = (unsigned int)((msr - MSR_F15H_PERF_CTL0) / 2);
> + if ((msr == (MSR_F15H_PERF_CTL0 + 2 * idx)) !=
> + (type == PMU_TYPE_EVNTSEL))
This is more complicated than it needs to be. CTLn is even, CTRn is odd (I think
I got the logic right, but the below is untested).
And this all needs a comment.
/*
* Each PMU counter has a pair of CTL and CTR MSRs. CTLn MSRs
* (accessed via EVNTSEL) are even, CTRn MSRs are odd.
*/
idx = (unsigned int)((msr - MSR_F15H_PERF_CTL0) / 2);
if (!(msr & 0x1) != (type == PMU_TYPE_EVNTSEL))
return NULL;
> + return NULL;
> + break;
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization
2022-08-30 17:29 ` [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Sean Christopherson
@ 2022-08-31 8:05 ` Like Xu
0 siblings, 0 replies; 16+ messages in thread
From: Like Xu @ 2022-08-31 8:05 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, Jim Mattson, linux-kernel, kvm
On 31/8/2022 1:29 am, Sean Christopherson wrote:
> On Tue, Aug 23, 2022, Like Xu wrote:
>> Good well-designed tests can help us find more bugs, especially when
>> the test steps differ from the Linux kernel behaviour in terms of the
>> timing of access to virtualized hw resources.
>>
>> In this patch series, there are three small optimization (006/007/008),
>> one hardware surprise (001), and most of these fixes have stepped
>> on my little toes.
>>
>> Please feel free to run tests, add more or share comments.
>>
>> Previous:
>> https://lore.kernel.org/kvm/20220721103549.49543-1-likexu@tencent.com/
>> https://lore.kernel.org/kvm/20220803130124.72340-1-likexu@tencent.com/
>>
>> V2 -> V2 RESEND Changelog:
>> - The "pebs_capable" fix has been merged into tip/perf/tree tree;
>> - Move the other two AMD vPMU optimization here;
>
> This is not a RESEND. These things very much matter because I'm sitting here
> trying to figure out whether I need to review this "v2", which is really v3, or
> whether I can just review the real v2.
>
> Don't add "RESEND" when you are submitting a modified version of your
> patch or patch series - "RESEND" only applies to resubmission of a
> patch or patch series which have not been modified in any way from the
> previous submission.
Roger that. Thanks for your detailed comments and all of that will be applied in V3.
The new AMD vPMU testcases[1] will also help guard future related changes.
[1] https://lore.kernel.org/kvm/20220819110939.78013-1-likexu@tencent.com/
Thanks,
Like Xu
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-08-31 8:05 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-23 9:32 [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Like Xu
2022-08-23 9:32 ` [PATCH RESEND v2 1/8] perf/x86/core: Completely disable guest PEBS via guest's global_ctrl Like Xu
2022-08-30 17:40 ` Sean Christopherson
2022-08-23 9:32 ` [PATCH RESEND v2 2/8] KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask Like Xu
2022-08-23 9:32 ` [PATCH RESEND v2 3/8] KVM: x86/pmu: Don't generate PEBS records for emulated instructions Like Xu
2022-08-23 9:32 ` [PATCH RESEND v2 4/8] KVM: x86/pmu: Avoid using PEBS perf_events for normal counters Like Xu
2022-08-23 9:32 ` [PATCH RESEND v2 5/8] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event() Like Xu
2022-08-30 17:50 ` Sean Christopherson
2022-08-23 9:32 ` [PATCH RESEND v2 6/8] KVM: x86/pmu: Defer counter emulated overflow via pmc->stale_counter Like Xu
2022-08-30 17:59 ` Sean Christopherson
2022-08-23 9:32 ` [PATCH RESEND v2 7/8] KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement amd_*_to_pmc() Like Xu
2022-08-30 18:07 ` Sean Christopherson
2022-08-23 9:32 ` [PATCH RESEND v2 8/8] KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters scalability Like Xu
2022-08-30 18:24 ` Sean Christopherson
2022-08-30 17:29 ` [PATCH RESEND v2 0/8] x86/pmu: Corner cases fixes and optimization Sean Christopherson
2022-08-31 8:05 ` Like Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox