* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code [not found] ` <20230504110003.2548-4-ravi.bangoria@amd.com> @ 2023-05-24 21:41 ` Nathan Chancellor 2023-05-25 5:16 ` Ravi Bangoria 0 siblings, 1 reply; 19+ messages in thread From: Nathan Chancellor @ 2023-05-24 21:41 UTC (permalink / raw) To: Ravi Bangoria Cc: peterz, namhyung, eranian, acme, mark.rutland, jolsa, irogers, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, maz, oliver.upton, kvmarm Hi Ravi, + arm64 KVM folks On Thu, May 04, 2023 at 04:30:02PM +0530, Ravi Bangoria wrote: > Searching for the right pmu by iterating over all pmus is no longer > required since all pmus now *must* be present in the 'pmu_idr' list. > So, remove linear searching code. > > Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> > --- > kernel/events/core.c | 37 +++++++++++++------------------------ > 1 file changed, 13 insertions(+), 24 deletions(-) > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 0695bb9fbbb6..eba2b8595115 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -11630,38 +11630,27 @@ static struct pmu *perf_init_event(struct perf_event *event) > } > > again: > + ret = -ENOENT; > rcu_read_lock(); > pmu = idr_find(&pmu_idr, type); > rcu_read_unlock(); > - if (pmu) { > - if (event->attr.type != type && type != PERF_TYPE_RAW && > - !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE)) > - goto fail; > - > - ret = perf_try_init_event(pmu, event); > - if (ret == -ENOENT && event->attr.type != type && !extended_type) { > - type = event->attr.type; > - goto again; > - } > + if (!pmu) > + goto fail; > > - if (ret) > - pmu = ERR_PTR(ret); > + if (event->attr.type != type && type != PERF_TYPE_RAW && > + !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE)) > + goto fail; > > - goto unlock; > + ret = perf_try_init_event(pmu, event); > + if (ret == -ENOENT && event->attr.type != type && !extended_type) { > + type = event->attr.type; > + goto again; > } > > - list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) { > - ret = perf_try_init_event(pmu, event); > - if (!ret) > - goto unlock; > - > - if (ret != -ENOENT) { > - pmu = ERR_PTR(ret); > - goto unlock; > - } > - } > fail: > - pmu = ERR_PTR(-ENOENT); > + if (ret) > + pmu = ERR_PTR(ret); > + > unlock: > srcu_read_unlock(&pmus_srcu, idx); > > -- > 2.40.0 > My apologies if this has already been reported or fixed already, I did a search of lore.kernel.org and did not find anything. This patch as commit 9551fbb64d09 ("perf/core: Remove pmu linear searching code") in -next breaks starting QEMU with KVM enabled on two of my arm64 machines: $ qemu-system-aarch64 \ -display none \ -nodefaults \ -machine virt,gic-version=max \ -append 'console=ttyAMA0 earlycon' \ -kernel arch/arm64/boot/Image.gz \ -initrd rootfs.cpio \ -cpu host \ -enable-kvm \ -m 512m \ -smp 8 \ -serial mon:stdio qemu-system-aarch64: PMU: KVM_SET_DEVICE_ATTR: No such device qemu-system-aarch64: failed to set irq for PMU In the kernel log, I see [ 42.944952] kvm: pmu event creation failed -2 I am not sure if this issue is unexpected as a result of this change or if there is something that needs to change on the arm64 KVM side (it appears the kernel message comes from arch/arm64/kvm/pmu-emul.c). If there is any further information I can provide or patches I can test, I am more than happy to do so. Cheers, Nathan # bad: [cf09e328589a2ed7f6c8d90f2edb697fb4f8a96b] Add linux-next specific files for 20230524 # good: [27e462c8fad4bf04ec4f81f8539ce6fa947ead3a] Merge tag 'xtensa-20230523' of https://github.com/jcmvbkbc/linux-xtensa git bisect start 'cf09e328589a2ed7f6c8d90f2edb697fb4f8a96b' '27e462c8fad4bf04ec4f81f8539ce6fa947ead3a' # good: [a20d8ab9e26daaeeaf971139b736981cf164ab0a] Merge branch 'for-linux-next' of git://anongit.freedesktop.org/drm/drm-misc git bisect good a20d8ab9e26daaeeaf971139b736981cf164ab0a # good: [2714032dfd641b22695e14efd5f9dff08a5e3245] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git git bisect good 2714032dfd641b22695e14efd5f9dff08a5e3245 # bad: [b2bc2854ec87557033538aa9290f70b9141a6653] Merge branch 'for-leds-next' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds.git git bisect bad b2bc2854ec87557033538aa9290f70b9141a6653 # good: [20d4044f23c7724020b6c7d34ccee9bb929d1078] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git git bisect good 20d4044f23c7724020b6c7d34ccee9bb929d1078 # bad: [c3cab2fce7b318ee2edf148b1436f3a3864ae773] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git git bisect bad c3cab2fce7b318ee2edf148b1436f3a3864ae773 # bad: [af75e6871092fc1f9fa039132d5667f1e0a47a0a] Merge branch into tip/master: 'sched/core' git bisect bad af75e6871092fc1f9fa039132d5667f1e0a47a0a # good: [c50a7b40a2f50403c3f58f1c27a85e4c5d2e0865] Merge branch into tip/master: 'objtool/core' git bisect good c50a7b40a2f50403c3f58f1c27a85e4c5d2e0865 # good: [519fabc7aaba3f0847cf37d5f9a5740c370eb777] psi: remove 500ms min window size limitation for triggers git bisect good 519fabc7aaba3f0847cf37d5f9a5740c370eb777 # bad: [b85c6694924e9f09a40a2e0a3798f3945eaa6fda] Merge branch into tip/master: 'perf/core' git bisect bad b85c6694924e9f09a40a2e0a3798f3945eaa6fda # bad: [9551fbb64d094cc105964716224adeb7765df8fd] perf/core: Remove pmu linear searching code git bisect bad 9551fbb64d094cc105964716224adeb7765df8fd # good: [2fad201fe38ff9a692acedb1990ece2c52a29f95] perf/ibs: Fix interface via core pmu events git bisect good 2fad201fe38ff9a692acedb1990ece2c52a29f95 # first bad commit: [9551fbb64d094cc105964716224adeb7765df8fd] perf/core: Remove pmu linear searching code ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-24 21:41 ` [PATCH v4 3/4] perf/core: Remove pmu linear searching code Nathan Chancellor @ 2023-05-25 5:16 ` Ravi Bangoria 2023-05-25 7:11 ` Oliver Upton 0 siblings, 1 reply; 19+ messages in thread From: Ravi Bangoria @ 2023-05-25 5:16 UTC (permalink / raw) To: Nathan Chancellor Cc: peterz, namhyung, eranian, acme, mark.rutland, jolsa, irogers, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, maz, oliver.upton, kvmarm, Ravi Bangoria Hi Nathan, On 25-May-23 3:11 AM, Nathan Chancellor wrote: > Hi Ravi, > > + arm64 KVM folks > > On Thu, May 04, 2023 at 04:30:02PM +0530, Ravi Bangoria wrote: >> Searching for the right pmu by iterating over all pmus is no longer >> required since all pmus now *must* be present in the 'pmu_idr' list. >> So, remove linear searching code. >> >> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> >> --- >> kernel/events/core.c | 37 +++++++++++++------------------------ >> 1 file changed, 13 insertions(+), 24 deletions(-) >> >> diff --git a/kernel/events/core.c b/kernel/events/core.c >> index 0695bb9fbbb6..eba2b8595115 100644 >> --- a/kernel/events/core.c >> +++ b/kernel/events/core.c >> @@ -11630,38 +11630,27 @@ static struct pmu *perf_init_event(struct perf_event *event) >> } >> >> again: >> + ret = -ENOENT; >> rcu_read_lock(); >> pmu = idr_find(&pmu_idr, type); >> rcu_read_unlock(); >> - if (pmu) { >> - if (event->attr.type != type && type != PERF_TYPE_RAW && >> - !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE)) >> - goto fail; >> - >> - ret = perf_try_init_event(pmu, event); >> - if (ret == -ENOENT && event->attr.type != type && !extended_type) { >> - type = event->attr.type; >> - goto again; >> - } >> + if (!pmu) >> + goto fail; >> >> - if (ret) >> - pmu = ERR_PTR(ret); >> + if (event->attr.type != type && type != PERF_TYPE_RAW && >> + !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE)) >> + goto fail; >> >> - goto unlock; >> + ret = perf_try_init_event(pmu, event); >> + if (ret == -ENOENT && event->attr.type != type && !extended_type) { >> + type = event->attr.type; >> + goto again; >> } >> >> - list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) { >> - ret = perf_try_init_event(pmu, event); >> - if (!ret) >> - goto unlock; >> - >> - if (ret != -ENOENT) { >> - pmu = ERR_PTR(ret); >> - goto unlock; >> - } >> - } >> fail: >> - pmu = ERR_PTR(-ENOENT); >> + if (ret) >> + pmu = ERR_PTR(ret); >> + >> unlock: >> srcu_read_unlock(&pmus_srcu, idx); >> >> -- >> 2.40.0 >> > > My apologies if this has already been reported or fixed already, I did a > search of lore.kernel.org and did not find anything. This patch as > commit 9551fbb64d09 ("perf/core: Remove pmu linear searching code") in > -next breaks starting QEMU with KVM enabled on two of my arm64 machines: > > $ qemu-system-aarch64 \ > -display none \ > -nodefaults \ > -machine virt,gic-version=max \ > -append 'console=ttyAMA0 earlycon' \ > -kernel arch/arm64/boot/Image.gz \ > -initrd rootfs.cpio \ > -cpu host \ > -enable-kvm \ > -m 512m \ > -smp 8 \ > -serial mon:stdio > qemu-system-aarch64: PMU: KVM_SET_DEVICE_ATTR: No such device > qemu-system-aarch64: failed to set irq for PMU > > In the kernel log, I see > > [ 42.944952] kvm: pmu event creation failed -2 > > I am not sure if this issue is unexpected as a result of this change or > if there is something that needs to change on the arm64 KVM side (it > appears the kernel message comes from arch/arm64/kvm/pmu-emul.c). Thanks for reporting it. Based on these detail, I feel the pmu registration failed in the host, most probably because pmu driver did not pass pmu name while calling perf_pmu_register(). Consequently kvm also failed while trying to use it for guest. Can you please check host kernel logs. I'm sorry but I neither have Arm board to try myself not I'm familiar with the Arm architecture. So I'll need your help to diagnose it. Ravi ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-25 5:16 ` Ravi Bangoria @ 2023-05-25 7:11 ` Oliver Upton 2023-05-25 14:20 ` Peter Zijlstra 2023-05-25 15:55 ` Nathan Chancellor 0 siblings, 2 replies; 19+ messages in thread From: Oliver Upton @ 2023-05-25 7:11 UTC (permalink / raw) To: Ravi Bangoria Cc: Nathan Chancellor, peterz, namhyung, eranian, acme, mark.rutland, jolsa, irogers, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, maz, kvmarm On Thu, May 25, 2023 at 10:46:01AM +0530, Ravi Bangoria wrote: > On 25-May-23 3:11 AM, Nathan Chancellor wrote: > > My apologies if this has already been reported or fixed already, I did a > > search of lore.kernel.org and did not find anything. This patch as > > commit 9551fbb64d09 ("perf/core: Remove pmu linear searching code") in > > -next breaks starting QEMU with KVM enabled on two of my arm64 machines: > > > > $ qemu-system-aarch64 \ > > -display none \ > > -nodefaults \ > > -machine virt,gic-version=max \ > > -append 'console=ttyAMA0 earlycon' \ > > -kernel arch/arm64/boot/Image.gz \ > > -initrd rootfs.cpio \ > > -cpu host \ > > -enable-kvm \ > > -m 512m \ > > -smp 8 \ > > -serial mon:stdio > > qemu-system-aarch64: PMU: KVM_SET_DEVICE_ATTR: No such device > > qemu-system-aarch64: failed to set irq for PMU > > > > In the kernel log, I see > > > > [ 42.944952] kvm: pmu event creation failed -2 > > > > I am not sure if this issue is unexpected as a result of this change or > > if there is something that needs to change on the arm64 KVM side (it > > appears the kernel message comes from arch/arm64/kvm/pmu-emul.c). > > Thanks for reporting it. > > Based on these detail, I feel the pmu registration failed in the host, > most probably because pmu driver did not pass pmu name while calling > perf_pmu_register(). Consequently kvm also failed while trying to use > it for guest. Can you please check host kernel logs. The PMUv3 driver does pass a name, but it relies on getting back an allocated pmu id as @type is -1 in the call to perf_pmu_register(). What actually broke is how KVM probes for a default core PMU to use for a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and reads the pmu from the returned perf_event. The linear search had the effect of eventually stumbling on the correct core PMU and succeeding. Perf folks: is this WAI for heterogenous systems? Either way, the whole KVM end of this scheme is a bit clunky, and I believe it to be unneccessary at this point as we maintain a list of core PMU instances that KVM is able to virtualize. We can just walk that to find a default PMU to use. Not seeing any issues on -next with the below diff. If this works for folks I can actually wrap it up in a patch and send it out. diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index 45727d50d18d..cbc0b662b7f8 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -694,47 +694,26 @@ void kvm_host_pmu_init(struct arm_pmu *pmu) static struct arm_pmu *kvm_pmu_probe_armpmu(void) { - struct perf_event_attr attr = { }; - struct perf_event *event; - struct arm_pmu *pmu = NULL; - - /* - * Create a dummy event that only counts user cycles. As we'll never - * leave this function with the event being live, it will never - * count anything. But it allows us to probe some of the PMU - * details. Yes, this is terrible. - */ - attr.type = PERF_TYPE_RAW; - attr.size = sizeof(attr); - attr.pinned = 1; - attr.disabled = 0; - attr.exclude_user = 0; - attr.exclude_kernel = 1; - attr.exclude_hv = 1; - attr.exclude_host = 1; - attr.config = ARMV8_PMUV3_PERFCTR_CPU_CYCLES; - attr.sample_period = GENMASK(63, 0); + struct arm_pmu *arm_pmu = NULL, *tmp; + struct arm_pmu_entry *entry; + int cpu; - event = perf_event_create_kernel_counter(&attr, -1, current, - kvm_pmu_perf_overflow, &attr); + mutex_lock(&arm_pmus_lock); + cpu = get_cpu(); - if (IS_ERR(event)) { - pr_err_once("kvm: pmu event creation failed %ld\n", - PTR_ERR(event)); - return NULL; - } + list_for_each_entry(entry, &arm_pmus, entry) { + tmp = entry->arm_pmu; - if (event->pmu) { - pmu = to_arm_pmu(event->pmu); - if (pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_NI || - pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) - pmu = NULL; + if (cpumask_test_cpu(cpu, &tmp->supported_cpus)) { + arm_pmu = tmp; + break; + } } - perf_event_disable(event); - perf_event_release_kernel(event); + put_cpu(); + mutex_unlock(&arm_pmus_lock); - return pmu; + return arm_pmu; } u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1) -- Thanks, Oliver ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-25 7:11 ` Oliver Upton @ 2023-05-25 14:20 ` Peter Zijlstra 2023-05-25 15:56 ` Oliver Upton 2023-05-25 15:55 ` Nathan Chancellor 1 sibling, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2023-05-25 14:20 UTC (permalink / raw) To: Oliver Upton Cc: Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, irogers, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, maz, kvmarm On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote: > The PMUv3 driver does pass a name, but it relies on getting back an > allocated pmu id as @type is -1 in the call to perf_pmu_register(). > > What actually broke is how KVM probes for a default core PMU to use for > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and > reads the pmu from the returned perf_event. The linear search had the > effect of eventually stumbling on the correct core PMU and succeeding. > > Perf folks: is this WAI for heterogenous systems? TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not sure what ARM64 does here. IIRC the only way is to hard affine things; that is, force vCPU of 'type' to the pCPU mask of 'type' CPUs. If you don't do that; or let userspace 'override' that, things go sideways *real* fast. Mark gonna have to look at this. > Either way, the whole KVM end of this scheme is a bit clunky, and I > believe it to be unneccessary at this point as we maintain a list of > core PMU instances that KVM is able to virtualize. We can just walk > that to find a default PMU to use. > > Not seeing any issues on -next with the below diff. If this works for > folks I can actually wrap it up in a patch and send it out. > > diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c > index 45727d50d18d..cbc0b662b7f8 100644 > --- a/arch/arm64/kvm/pmu-emul.c > +++ b/arch/arm64/kvm/pmu-emul.c > @@ -694,47 +694,26 @@ void kvm_host_pmu_init(struct arm_pmu *pmu) > > static struct arm_pmu *kvm_pmu_probe_armpmu(void) > { > - struct perf_event_attr attr = { }; > - struct perf_event *event; > - struct arm_pmu *pmu = NULL; > - > - /* > - * Create a dummy event that only counts user cycles. As we'll never > - * leave this function with the event being live, it will never > - * count anything. But it allows us to probe some of the PMU > - * details. Yes, this is terrible. > - */ > - attr.type = PERF_TYPE_RAW; > - attr.size = sizeof(attr); > - attr.pinned = 1; > - attr.disabled = 0; > - attr.exclude_user = 0; > - attr.exclude_kernel = 1; > - attr.exclude_hv = 1; > - attr.exclude_host = 1; > - attr.config = ARMV8_PMUV3_PERFCTR_CPU_CYCLES; > - attr.sample_period = GENMASK(63, 0); > + struct arm_pmu *arm_pmu = NULL, *tmp; > + struct arm_pmu_entry *entry; > + int cpu; > > - event = perf_event_create_kernel_counter(&attr, -1, current, > - kvm_pmu_perf_overflow, &attr); > + mutex_lock(&arm_pmus_lock); > + cpu = get_cpu(); > > - if (IS_ERR(event)) { > - pr_err_once("kvm: pmu event creation failed %ld\n", > - PTR_ERR(event)); > - return NULL; > - } > + list_for_each_entry(entry, &arm_pmus, entry) { > + tmp = entry->arm_pmu; > > - if (event->pmu) { > - pmu = to_arm_pmu(event->pmu); > - if (pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_NI || > - pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) > - pmu = NULL; > + if (cpumask_test_cpu(cpu, &tmp->supported_cpus)) { > + arm_pmu = tmp; > + break; > + } > } > > - perf_event_disable(event); > - perf_event_release_kernel(event); > + put_cpu(); > + mutex_unlock(&arm_pmus_lock); > > - return pmu; > + return arm_pmu; > } ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-25 14:20 ` Peter Zijlstra @ 2023-05-25 15:56 ` Oliver Upton 2023-05-26 23:00 ` Ian Rogers 0 siblings, 1 reply; 19+ messages in thread From: Oliver Upton @ 2023-05-25 15:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, irogers, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, maz, kvmarm On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote: > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote: > > > The PMUv3 driver does pass a name, but it relies on getting back an > > allocated pmu id as @type is -1 in the call to perf_pmu_register(). > > > > What actually broke is how KVM probes for a default core PMU to use for > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and > > reads the pmu from the returned perf_event. The linear search had the > > effect of eventually stumbling on the correct core PMU and succeeding. > > > > Perf folks: is this WAI for heterogenous systems? > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not > sure what ARM64 does here. > > IIRC the only way is to hard affine things; that is, force vCPU of > 'type' to the pCPU mask of 'type' CPUs. We provide absolutely no illusion of consistency across implementations. Userspace can select the PMU type, and then it is a userspace problem affining vCPUs to the right pCPUs. And if they get that wrong, we just bail and refuse to run the vCPU. > If you don't do that; or let userspace 'override' that, things go > sideways *real* fast. Oh yeah, and I wish PMUs were the only problem with these hetero systems... > Mark gonna have to look at this. Cool. I'll go ahead with the KVM cleanup regardless of the outcome. -- Thanks, Oliver ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-25 15:56 ` Oliver Upton @ 2023-05-26 23:00 ` Ian Rogers 2023-05-27 13:32 ` Marc Zyngier 0 siblings, 1 reply; 19+ messages in thread From: Ian Rogers @ 2023-05-26 23:00 UTC (permalink / raw) To: Oliver Upton Cc: Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, maz, kvmarm On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote: > > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote: > > > > > The PMUv3 driver does pass a name, but it relies on getting back an > > > allocated pmu id as @type is -1 in the call to perf_pmu_register(). > > > > > > What actually broke is how KVM probes for a default core PMU to use for > > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and > > > reads the pmu from the returned perf_event. The linear search had the > > > effect of eventually stumbling on the correct core PMU and succeeding. > > > > > > Perf folks: is this WAI for heterogenous systems? > > > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not > > sure what ARM64 does here. > > > > IIRC the only way is to hard affine things; that is, force vCPU of > > 'type' to the pCPU mask of 'type' CPUs. > > We provide absolutely no illusion of consistency across implementations. > Userspace can select the PMU type, and then it is a userspace problem > affining vCPUs to the right pCPUs. > > And if they get that wrong, we just bail and refuse to run the vCPU. > > > If you don't do that; or let userspace 'override' that, things go > > sideways *real* fast. > > Oh yeah, and I wish PMUs were the only problem with these hetero > systems... Just to add some context from what I understand. There are inbuilt type numbers for PMUs: https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34 so the PMU generally called /sys/devices/cpu should have type 4 (ARM give it another name). For heterogeneous ARM there is a single PMU and the same events are programmed regardless of whether it is a big or a little core - the cpumask lists all CPUs. On heterogeneous (aka hybrid) Intel there are two PMUs, the performance cores have a PMU called /sys/devices/cpu_core and it has type 4, the atom cores have a PMU of /sys/devices/cpu_atom and on my Alderlake the type number is 8. The cpu_core and cpu_atom PMUs list the CPUs that are valid for raw style events, where the config values in perf_event_attr contains all of the event programming data. There are also legacy events of PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE where to specify the PMU the type is encoded in the high (and unused) 32-bits of config - so the type would be something like PERF_TYPE_HARDWARE and then config would be "value | (4 << 32)" for the performance core or "value | (8 << 32)" for the atom. If the vCPU and pCPUs mappings vary then there is a chance to change the CPU mask on heterogeneous Intel, but it seems if the event is open and you move from between core types then things are going to break. Thanks, Ian > > Mark gonna have to look at this. > > Cool. I'll go ahead with the KVM cleanup regardless of the outcome. > > -- > Thanks, > Oliver ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-26 23:00 ` Ian Rogers @ 2023-05-27 13:32 ` Marc Zyngier 2023-05-27 17:00 ` Ian Rogers 0 siblings, 1 reply; 19+ messages in thread From: Marc Zyngier @ 2023-05-27 13:32 UTC (permalink / raw) To: Ian Rogers Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On Sat, 27 May 2023 00:00:47 +0100, Ian Rogers <irogers@google.com> wrote: > > On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote: > > > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote: > > > > > > > The PMUv3 driver does pass a name, but it relies on getting back an > > > > allocated pmu id as @type is -1 in the call to perf_pmu_register(). > > > > > > > > What actually broke is how KVM probes for a default core PMU to use for > > > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and > > > > reads the pmu from the returned perf_event. The linear search had the > > > > effect of eventually stumbling on the correct core PMU and succeeding. > > > > > > > > Perf folks: is this WAI for heterogenous systems? > > > > > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not > > > sure what ARM64 does here. > > > > > > IIRC the only way is to hard affine things; that is, force vCPU of > > > 'type' to the pCPU mask of 'type' CPUs. > > > > We provide absolutely no illusion of consistency across implementations. > > Userspace can select the PMU type, and then it is a userspace problem > > affining vCPUs to the right pCPUs. > > > > And if they get that wrong, we just bail and refuse to run the vCPU. > > > > > If you don't do that; or let userspace 'override' that, things go > > > sideways *real* fast. > > > > Oh yeah, and I wish PMUs were the only problem with these hetero > > systems... > > Just to add some context from what I understand. There are inbuilt > type numbers for PMUs: > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34 > so the PMU generally called /sys/devices/cpu should have type 4 (ARM > give it another name). For heterogeneous ARM there is a single PMU and > the same events are programmed regardless of whether it is a big or a > little core - the cpumask lists all CPUs. I think you misunderstood the way heterogeneous arm64 systems are described . Each CPU type gets its own PMU type, and its own event list. Case in point: $ grep . /sys/devices/*pmu/{type,cpus} /sys/devices/apple_avalanche_pmu/type:9 /sys/devices/apple_blizzard_pmu/type:8 /sys/devices/apple_avalanche_pmu/cpus:4-9 /sys/devices/apple_blizzard_pmu/cpus:0-3 Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw event number, nothing else. Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-27 13:32 ` Marc Zyngier @ 2023-05-27 17:00 ` Ian Rogers 2023-05-27 17:05 ` Ian Rogers 2023-05-27 18:38 ` Marc Zyngier 0 siblings, 2 replies; 19+ messages in thread From: Ian Rogers @ 2023-05-27 17:00 UTC (permalink / raw) To: Marc Zyngier Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote: > > On Sat, 27 May 2023 00:00:47 +0100, > Ian Rogers <irogers@google.com> wrote: > > > > On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote: > > > > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote: > > > > > > > > > The PMUv3 driver does pass a name, but it relies on getting back an > > > > > allocated pmu id as @type is -1 in the call to perf_pmu_register(). > > > > > > > > > > What actually broke is how KVM probes for a default core PMU to use for > > > > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and > > > > > reads the pmu from the returned perf_event. The linear search had the > > > > > effect of eventually stumbling on the correct core PMU and succeeding. > > > > > > > > > > Perf folks: is this WAI for heterogenous systems? > > > > > > > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not > > > > sure what ARM64 does here. > > > > > > > > IIRC the only way is to hard affine things; that is, force vCPU of > > > > 'type' to the pCPU mask of 'type' CPUs. > > > > > > We provide absolutely no illusion of consistency across implementations. > > > Userspace can select the PMU type, and then it is a userspace problem > > > affining vCPUs to the right pCPUs. > > > > > > And if they get that wrong, we just bail and refuse to run the vCPU. > > > > > > > If you don't do that; or let userspace 'override' that, things go > > > > sideways *real* fast. > > > > > > Oh yeah, and I wish PMUs were the only problem with these hetero > > > systems... > > > > Just to add some context from what I understand. There are inbuilt > > type numbers for PMUs: > > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34 > > so the PMU generally called /sys/devices/cpu should have type 4 (ARM > > give it another name). For heterogeneous ARM there is a single PMU and > > the same events are programmed regardless of whether it is a big or a > > little core - the cpumask lists all CPUs. > > I think you misunderstood the way heterogeneous arm64 systems are > described . Each CPU type gets its own PMU type, and its own event > list. Case in point: > > $ grep . /sys/devices/*pmu/{type,cpus} > /sys/devices/apple_avalanche_pmu/type:9 > /sys/devices/apple_blizzard_pmu/type:8 > /sys/devices/apple_avalanche_pmu/cpus:4-9 > /sys/devices/apple_blizzard_pmu/cpus:0-3 > > Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw > event number, nothing else. Which PMU will a raw event open on? Note, the raw events don't support the extended type that is present in PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41 as the bits are already in use for being just plain config values. I suspect not being type 4 is a bug on apple ARM here. Thanks, Ian > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-27 17:00 ` Ian Rogers @ 2023-05-27 17:05 ` Ian Rogers 2023-05-27 18:38 ` Marc Zyngier 1 sibling, 0 replies; 19+ messages in thread From: Ian Rogers @ 2023-05-27 17:05 UTC (permalink / raw) To: Marc Zyngier Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On Sat, May 27, 2023 at 10:00 AM Ian Rogers <irogers@google.com> wrote: > > On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Sat, 27 May 2023 00:00:47 +0100, > > Ian Rogers <irogers@google.com> wrote: > > > > > > On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > > > On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote: > > > > > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote: > > > > > > > > > > > The PMUv3 driver does pass a name, but it relies on getting back an > > > > > > allocated pmu id as @type is -1 in the call to perf_pmu_register(). > > > > > > > > > > > > What actually broke is how KVM probes for a default core PMU to use for > > > > > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and > > > > > > reads the pmu from the returned perf_event. The linear search had the > > > > > > effect of eventually stumbling on the correct core PMU and succeeding. > > > > > > > > > > > > Perf folks: is this WAI for heterogenous systems? > > > > > > > > > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not > > > > > sure what ARM64 does here. > > > > > > > > > > IIRC the only way is to hard affine things; that is, force vCPU of > > > > > 'type' to the pCPU mask of 'type' CPUs. > > > > > > > > We provide absolutely no illusion of consistency across implementations. > > > > Userspace can select the PMU type, and then it is a userspace problem > > > > affining vCPUs to the right pCPUs. > > > > > > > > And if they get that wrong, we just bail and refuse to run the vCPU. > > > > > > > > > If you don't do that; or let userspace 'override' that, things go > > > > > sideways *real* fast. > > > > > > > > Oh yeah, and I wish PMUs were the only problem with these hetero > > > > systems... > > > > > > Just to add some context from what I understand. There are inbuilt > > > type numbers for PMUs: > > > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34 > > > so the PMU generally called /sys/devices/cpu should have type 4 (ARM > > > give it another name). For heterogeneous ARM there is a single PMU and > > > the same events are programmed regardless of whether it is a big or a > > > little core - the cpumask lists all CPUs. > > > > I think you misunderstood the way heterogeneous arm64 systems are > > described . Each CPU type gets its own PMU type, and its own event > > list. Case in point: > > > > $ grep . /sys/devices/*pmu/{type,cpus} > > /sys/devices/apple_avalanche_pmu/type:9 > > /sys/devices/apple_blizzard_pmu/type:8 > > /sys/devices/apple_avalanche_pmu/cpus:4-9 > > /sys/devices/apple_blizzard_pmu/cpus:0-3 > > > > Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw > > event number, nothing else. > > Which PMU will a raw event open on? Note, the raw events don't support > the extended type that is present in PERF_TYPE_HARDWARE and > PERF_TYPE_HW_CACHE: > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41 > as the bits are already in use for being just plain config values. I > suspect not being type 4 is a bug on apple ARM here. > > Thanks, > Ian Btw, thanks for correcting me! :-D These assumptions are under documented/tested and nobody wants to make interfaces/tools that are broken. One source of information is: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/Documentation/admin-guide/perf but it doesn't capture this. Thanks, Ian > > Thanks, > > > > M. > > > > -- > > Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-27 17:00 ` Ian Rogers 2023-05-27 17:05 ` Ian Rogers @ 2023-05-27 18:38 ` Marc Zyngier 2023-05-27 19:50 ` Ian Rogers 2023-05-30 7:45 ` Thomas Richter 1 sibling, 2 replies; 19+ messages in thread From: Marc Zyngier @ 2023-05-27 18:38 UTC (permalink / raw) To: Ian Rogers Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On Sat, 27 May 2023 18:00:13 +0100, Ian Rogers <irogers@google.com> wrote: > > On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Sat, 27 May 2023 00:00:47 +0100, > > Ian Rogers <irogers@google.com> wrote: > > > > > > On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > > > On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote: > > > > > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote: > > > > > > > > > > > The PMUv3 driver does pass a name, but it relies on getting back an > > > > > > allocated pmu id as @type is -1 in the call to perf_pmu_register(). > > > > > > > > > > > > What actually broke is how KVM probes for a default core PMU to use for > > > > > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and > > > > > > reads the pmu from the returned perf_event. The linear search had the > > > > > > effect of eventually stumbling on the correct core PMU and succeeding. > > > > > > > > > > > > Perf folks: is this WAI for heterogenous systems? > > > > > > > > > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not > > > > > sure what ARM64 does here. > > > > > > > > > > IIRC the only way is to hard affine things; that is, force vCPU of > > > > > 'type' to the pCPU mask of 'type' CPUs. > > > > > > > > We provide absolutely no illusion of consistency across implementations. > > > > Userspace can select the PMU type, and then it is a userspace problem > > > > affining vCPUs to the right pCPUs. > > > > > > > > And if they get that wrong, we just bail and refuse to run the vCPU. > > > > > > > > > If you don't do that; or let userspace 'override' that, things go > > > > > sideways *real* fast. > > > > > > > > Oh yeah, and I wish PMUs were the only problem with these hetero > > > > systems... > > > > > > Just to add some context from what I understand. There are inbuilt > > > type numbers for PMUs: > > > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34 > > > so the PMU generally called /sys/devices/cpu should have type 4 (ARM > > > give it another name). For heterogeneous ARM there is a single PMU and > > > the same events are programmed regardless of whether it is a big or a > > > little core - the cpumask lists all CPUs. > > > > I think you misunderstood the way heterogeneous arm64 systems are > > described . Each CPU type gets its own PMU type, and its own event > > list. Case in point: > > > > $ grep . /sys/devices/*pmu/{type,cpus} > > /sys/devices/apple_avalanche_pmu/type:9 > > /sys/devices/apple_blizzard_pmu/type:8 > > /sys/devices/apple_avalanche_pmu/cpus:4-9 > > /sys/devices/apple_blizzard_pmu/cpus:0-3 > > > > Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw > > event number, nothing else. > > Which PMU will a raw event open on? On the PMU that matches the current CPU. > Note, the raw events don't support > the extended type that is present in PERF_TYPE_HARDWARE and > PERF_TYPE_HW_CACHE: > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41 > as the bits are already in use for being just plain config values. I'm not sure how relevant this is to the numbering of PMUs on arm64. > I suspect not being type 4 is a bug on apple ARM here. If that's a bug on this machine, it's a bug on all machines, at which point it is the de-facto API: $ grep . /sys/devices/armv8*/{type,cpus} /sys/devices/armv8_cortex_a53/type:8 /sys/devices/armv8_cortex_a72/type:9 /sys/devices/armv8_cortex_a53/cpus:0-3 /sys/devices/armv8_cortex_a72/cpus:4-5 See, non-Apple HW. And now for a system with homogeneous CPUs: $ grep . /sys/devices/armv8*/{type,cpus} /sys/devices/armv8_pmuv3_0/type:8 /sys/devices/armv8_pmuv3_0/cpus:0-159 Still no type 4. I could go on for hours, I have plenty of HW around me! So whatever your source of information is, it doesn't match reality. Our PMUs are numbered arbitrarily, and have been so for... a very long time. At least since perf_pmu_register has supported dynamic registration (see 2e80a82a49c4c). Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-27 18:38 ` Marc Zyngier @ 2023-05-27 19:50 ` Ian Rogers 2023-05-30 7:45 ` Thomas Richter 1 sibling, 0 replies; 19+ messages in thread From: Ian Rogers @ 2023-05-27 19:50 UTC (permalink / raw) To: Marc Zyngier Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On Sat, May 27, 2023 at 11:38 AM Marc Zyngier <maz@kernel.org> wrote: > > On Sat, 27 May 2023 18:00:13 +0100, > Ian Rogers <irogers@google.com> wrote: > > > > On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > On Sat, 27 May 2023 00:00:47 +0100, > > > Ian Rogers <irogers@google.com> wrote: > > > > > > > > On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > > > > > On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote: > > > > > > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote: > > > > > > > > > > > > > The PMUv3 driver does pass a name, but it relies on getting back an > > > > > > > allocated pmu id as @type is -1 in the call to perf_pmu_register(). > > > > > > > > > > > > > > What actually broke is how KVM probes for a default core PMU to use for > > > > > > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and > > > > > > > reads the pmu from the returned perf_event. The linear search had the > > > > > > > effect of eventually stumbling on the correct core PMU and succeeding. > > > > > > > > > > > > > > Perf folks: is this WAI for heterogenous systems? > > > > > > > > > > > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not > > > > > > sure what ARM64 does here. > > > > > > > > > > > > IIRC the only way is to hard affine things; that is, force vCPU of > > > > > > 'type' to the pCPU mask of 'type' CPUs. > > > > > > > > > > We provide absolutely no illusion of consistency across implementations. > > > > > Userspace can select the PMU type, and then it is a userspace problem > > > > > affining vCPUs to the right pCPUs. > > > > > > > > > > And if they get that wrong, we just bail and refuse to run the vCPU. > > > > > > > > > > > If you don't do that; or let userspace 'override' that, things go > > > > > > sideways *real* fast. > > > > > > > > > > Oh yeah, and I wish PMUs were the only problem with these hetero > > > > > systems... > > > > > > > > Just to add some context from what I understand. There are inbuilt > > > > type numbers for PMUs: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34 > > > > so the PMU generally called /sys/devices/cpu should have type 4 (ARM > > > > give it another name). For heterogeneous ARM there is a single PMU and > > > > the same events are programmed regardless of whether it is a big or a > > > > little core - the cpumask lists all CPUs. > > > > > > I think you misunderstood the way heterogeneous arm64 systems are > > > described . Each CPU type gets its own PMU type, and its own event > > > list. Case in point: > > > > > > $ grep . /sys/devices/*pmu/{type,cpus} > > > /sys/devices/apple_avalanche_pmu/type:9 > > > /sys/devices/apple_blizzard_pmu/type:8 > > > /sys/devices/apple_avalanche_pmu/cpus:4-9 > > > /sys/devices/apple_blizzard_pmu/cpus:0-3 > > > > > > Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw > > > event number, nothing else. > > > > Which PMU will a raw event open on? > > On the PMU that matches the current CPU. Thanks. This should really be captured in the man page. Presumably that's the behavior with the -1 any CPU, and if a CPU is specified then the selected PMU will match that CPU? > > Note, the raw events don't support > > the extended type that is present in PERF_TYPE_HARDWARE and > > PERF_TYPE_HW_CACHE: > > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41 > > as the bits are already in use for being just plain config values. > > I'm not sure how relevant this is to the numbering of PMUs on arm64. It matters because on the tool side you can group events, for example if you are measuring IPC then it makes sense that the instructions and cycles events are both multiplexed as part of a group. The instructions and cycles events have a type of PERF_TYPE_HARDWARE as this is hard coded into the event parser in perf. For perf we may specify this as: perf stat -e '{cycles,instructions}' ... When we open the events on a heterogeneous system the expectation is we get a group of cycles and instructions for each PMU, the extended type in the attribute for each event will be set to the type of the PMU the group is targeting. When we parse the events we have cycles events per PMU then instructions events per PMU, we then resort and regroup the events as you can't have a group spanning PMUs except in a few special cases like software events. Now let's say we want to do a raw event from the json files, on Intel let's say we choose the INST_RETIRED.ANY event and we want to measure it with cycles: perf stat -e '{cycles,inst_retired.any}' ... The INST_RETIRED.ANY event will be matched on Intel with the PMUs "cpu_core" and "cpu_atom" on heterogeneous systems. The cycles event will be opened on two PMUs and the extended type set to each one. So we have differing perf_event_attr types, but we know it is valid to group the events because the raw event encoding's PMU type can be matched to the extended type of the hardware event. To better support the wildcard opening, sorting and grouping operations of events I want to have an ability in the perf tool to map a type number to a list of PMUs. Mapping a hardware/hw_cache type number to PMUs should give the set of core PMUs for a heterogeneous system. On heterogeneous Intel mapping a raw event (type 4) will give "cpu_core" but it sounds like on ARM, if no sysfs PMU has type number 4, then we should map 4 to the set of core PMUs. Mapping a type number to a set of perf tool's representation of PMUs isn't current behavior, but hopefully you can see that ARM's behavior is differing from Intel's, for which I'd been basing the implementation so that we can support heterogeneous metrics, etc. I'm also confused what should be the behavior of something like: perf stat -e r0 ... The code is going to open the event on every core PMU with the config set to 0: https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n1485 But perhaps the intent on ARM is that a single raw type event is opened? It seems the every core PMU behavior is more useful. > > I suspect not being type 4 is a bug on apple ARM here. > > If that's a bug on this machine, it's a bug on all machines, at which > point it is the de-facto API: > > $ grep . /sys/devices/armv8*/{type,cpus} > /sys/devices/armv8_cortex_a53/type:8 > /sys/devices/armv8_cortex_a72/type:9 > /sys/devices/armv8_cortex_a53/cpus:0-3 > /sys/devices/armv8_cortex_a72/cpus:4-5 > > See, non-Apple HW. And now for a system with homogeneous CPUs: > > $ grep . /sys/devices/armv8*/{type,cpus} > /sys/devices/armv8_pmuv3_0/type:8 > /sys/devices/armv8_pmuv3_0/cpus:0-159 > > Still no type 4. I could go on for hours, I have plenty of HW around > me! This is good to know, thanks for doing the research. I agree that it is a de-facto API as this has happened. I'm hoping in the future we can compress some /sys/devices directories and use them as test input. It obviously won't be possible to open the events, etc. but we can test things like the event/metric parsing matches an expectation. Thanks, Ian > So whatever your source of information is, it doesn't match reality. > Our PMUs are numbered arbitrarily, and have been so for... a very long > time. At least since perf_pmu_register has supported dynamic > registration (see 2e80a82a49c4c). > > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-27 18:38 ` Marc Zyngier 2023-05-27 19:50 ` Ian Rogers @ 2023-05-30 7:45 ` Thomas Richter 2023-05-30 14:00 ` Ian Rogers 1 sibling, 1 reply; 19+ messages in thread From: Thomas Richter @ 2023-05-30 7:45 UTC (permalink / raw) To: Marc Zyngier, Ian Rogers Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On 5/27/23 20:38, Marc Zyngier wrote: > On Sat, 27 May 2023 18:00:13 +0100, > Ian Rogers <irogers@google.com> wrote: >> >> On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote: >>> >>> On Sat, 27 May 2023 00:00:47 +0100, >>> Ian Rogers <irogers@google.com> wrote: >>>> >>>> On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote: >>>>> >>>>> On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote: >>>>>> On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote: >>>>>> >>>>>>> The PMUv3 driver does pass a name, but it relies on getting back an >>>>>>> allocated pmu id as @type is -1 in the call to perf_pmu_register(). >>>>>>> >>>>>>> What actually broke is how KVM probes for a default core PMU to use for >>>>>>> a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and >>>>>>> reads the pmu from the returned perf_event. The linear search had the >>>>>>> effect of eventually stumbling on the correct core PMU and succeeding. >>>>>>> >>>>>>> Perf folks: is this WAI for heterogenous systems? >>>>>> >>>>>> TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not >>>>>> sure what ARM64 does here. >>>>>> >>>>>> IIRC the only way is to hard affine things; that is, force vCPU of >>>>>> 'type' to the pCPU mask of 'type' CPUs. >>>>> >>>>> We provide absolutely no illusion of consistency across implementations. >>>>> Userspace can select the PMU type, and then it is a userspace problem >>>>> affining vCPUs to the right pCPUs. >>>>> >>>>> And if they get that wrong, we just bail and refuse to run the vCPU. >>>>> >>>>>> If you don't do that; or let userspace 'override' that, things go >>>>>> sideways *real* fast. >>>>> >>>>> Oh yeah, and I wish PMUs were the only problem with these hetero >>>>> systems... >>>> >>>> Just to add some context from what I understand. There are inbuilt >>>> type numbers for PMUs: >>>> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34 >>>> so the PMU generally called /sys/devices/cpu should have type 4 (ARM >>>> give it another name). For heterogeneous ARM there is a single PMU and >>>> the same events are programmed regardless of whether it is a big or a >>>> little core - the cpumask lists all CPUs. >>> >>> I think you misunderstood the way heterogeneous arm64 systems are >>> described . Each CPU type gets its own PMU type, and its own event >>> list. Case in point: >>> >>> $ grep . /sys/devices/*pmu/{type,cpus} >>> /sys/devices/apple_avalanche_pmu/type:9 >>> /sys/devices/apple_blizzard_pmu/type:8 >>> /sys/devices/apple_avalanche_pmu/cpus:4-9 >>> /sys/devices/apple_blizzard_pmu/cpus:0-3 >>> >>> Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw >>> event number, nothing else. >> >> Which PMU will a raw event open on? > > On the PMU that matches the current CPU. > >> Note, the raw events don't support >> the extended type that is present in PERF_TYPE_HARDWARE and >> PERF_TYPE_HW_CACHE: >> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41 >> as the bits are already in use for being just plain config values. > > I'm not sure how relevant this is to the numbering of PMUs on arm64. > >> I suspect not being type 4 is a bug on apple ARM here. > > If that's a bug on this machine, it's a bug on all machines, at which > point it is the de-facto API: > > $ grep . /sys/devices/armv8*/{type,cpus} > /sys/devices/armv8_cortex_a53/type:8 > /sys/devices/armv8_cortex_a72/type:9 > /sys/devices/armv8_cortex_a53/cpus:0-3 > /sys/devices/armv8_cortex_a72/cpus:4-5 > > See, non-Apple HW. And now for a system with homogeneous CPUs: > > $ grep . /sys/devices/armv8*/{type,cpus} > /sys/devices/armv8_pmuv3_0/type:8 > /sys/devices/armv8_pmuv3_0/cpus:0-159 > > Still no type 4. I could go on for hours, I have plenty of HW around > me! > > So whatever your source of information is, it doesn't match reality. > Our PMUs are numbered arbitrarily, and have been so for... a very long > time. At least since perf_pmu_register has supported dynamic > registration (see 2e80a82a49c4c). > > Thanks, > > M. > I agree with Marc, on s390 we have 5 different PMUs and all have arbitrary numbers and have totally different features: # ll /sys/devices/{cpum,pai}*/type -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_cf_diag/type -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_cf/type -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_sf/type -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/pai_crypto/type -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/pai_ext/type # grep . /sys/devices/{cpum,pai}*/type /sys/devices/cpum_cf_diag/type:9 /sys/devices/cpum_cf/type:8 /sys/devices/cpum_sf/type:4 /sys/devices/pai_crypto/type:10 /sys/devices/pai_ext/type:11 # Thanks Thomas -- Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany -- Vorsitzender des Aufsichtsrats: Gregor Pillen Geschäftsführung: David Faller Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-30 7:45 ` Thomas Richter @ 2023-05-30 14:00 ` Ian Rogers 2023-05-31 9:09 ` Thomas Richter 0 siblings, 1 reply; 19+ messages in thread From: Ian Rogers @ 2023-05-30 14:00 UTC (permalink / raw) To: Thomas Richter Cc: Marc Zyngier, Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On Tue, May 30, 2023 at 12:45 AM Thomas Richter <tmricht@linux.ibm.com> wrote: > > On 5/27/23 20:38, Marc Zyngier wrote: > > On Sat, 27 May 2023 18:00:13 +0100, > > Ian Rogers <irogers@google.com> wrote: > >> > >> On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote: > >>> > >>> On Sat, 27 May 2023 00:00:47 +0100, > >>> Ian Rogers <irogers@google.com> wrote: > >>>> > >>>> On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote: > >>>>> > >>>>> On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote: > >>>>>> On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote: > >>>>>> > >>>>>>> The PMUv3 driver does pass a name, but it relies on getting back an > >>>>>>> allocated pmu id as @type is -1 in the call to perf_pmu_register(). > >>>>>>> > >>>>>>> What actually broke is how KVM probes for a default core PMU to use for > >>>>>>> a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and > >>>>>>> reads the pmu from the returned perf_event. The linear search had the > >>>>>>> effect of eventually stumbling on the correct core PMU and succeeding. > >>>>>>> > >>>>>>> Perf folks: is this WAI for heterogenous systems? > >>>>>> > >>>>>> TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not > >>>>>> sure what ARM64 does here. > >>>>>> > >>>>>> IIRC the only way is to hard affine things; that is, force vCPU of > >>>>>> 'type' to the pCPU mask of 'type' CPUs. > >>>>> > >>>>> We provide absolutely no illusion of consistency across implementations. > >>>>> Userspace can select the PMU type, and then it is a userspace problem > >>>>> affining vCPUs to the right pCPUs. > >>>>> > >>>>> And if they get that wrong, we just bail and refuse to run the vCPU. > >>>>> > >>>>>> If you don't do that; or let userspace 'override' that, things go > >>>>>> sideways *real* fast. > >>>>> > >>>>> Oh yeah, and I wish PMUs were the only problem with these hetero > >>>>> systems... > >>>> > >>>> Just to add some context from what I understand. There are inbuilt > >>>> type numbers for PMUs: > >>>> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34 > >>>> so the PMU generally called /sys/devices/cpu should have type 4 (ARM > >>>> give it another name). For heterogeneous ARM there is a single PMU and > >>>> the same events are programmed regardless of whether it is a big or a > >>>> little core - the cpumask lists all CPUs. > >>> > >>> I think you misunderstood the way heterogeneous arm64 systems are > >>> described . Each CPU type gets its own PMU type, and its own event > >>> list. Case in point: > >>> > >>> $ grep . /sys/devices/*pmu/{type,cpus} > >>> /sys/devices/apple_avalanche_pmu/type:9 > >>> /sys/devices/apple_blizzard_pmu/type:8 > >>> /sys/devices/apple_avalanche_pmu/cpus:4-9 > >>> /sys/devices/apple_blizzard_pmu/cpus:0-3 > >>> > >>> Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw > >>> event number, nothing else. > >> > >> Which PMU will a raw event open on? > > > > On the PMU that matches the current CPU. > > > >> Note, the raw events don't support > >> the extended type that is present in PERF_TYPE_HARDWARE and > >> PERF_TYPE_HW_CACHE: > >> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41 > >> as the bits are already in use for being just plain config values. > > > > I'm not sure how relevant this is to the numbering of PMUs on arm64. > > > >> I suspect not being type 4 is a bug on apple ARM here. > > > > If that's a bug on this machine, it's a bug on all machines, at which > > point it is the de-facto API: > > > > $ grep . /sys/devices/armv8*/{type,cpus} > > /sys/devices/armv8_cortex_a53/type:8 > > /sys/devices/armv8_cortex_a72/type:9 > > /sys/devices/armv8_cortex_a53/cpus:0-3 > > /sys/devices/armv8_cortex_a72/cpus:4-5 > > > > See, non-Apple HW. And now for a system with homogeneous CPUs: > > > > $ grep . /sys/devices/armv8*/{type,cpus} > > /sys/devices/armv8_pmuv3_0/type:8 > > /sys/devices/armv8_pmuv3_0/cpus:0-159 > > > > Still no type 4. I could go on for hours, I have plenty of HW around > > me! > > > > So whatever your source of information is, it doesn't match reality. > > Our PMUs are numbered arbitrarily, and have been so for... a very long > > time. At least since perf_pmu_register has supported dynamic > > registration (see 2e80a82a49c4c). > > > > Thanks, > > > > M. > > > > > I agree with Marc, > on s390 we have 5 different PMUs and all have arbitrary numbers > and have totally different features: > > # ll /sys/devices/{cpum,pai}*/type > -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_cf_diag/type > -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_cf/type > -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_sf/type > -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/pai_crypto/type > -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/pai_ext/type > # grep . /sys/devices/{cpum,pai}*/type > /sys/devices/cpum_cf_diag/type:9 > /sys/devices/cpum_cf/type:8 > /sys/devices/cpum_sf/type:4 > /sys/devices/pai_crypto/type:10 > /sys/devices/pai_ext/type:11 > # Thanks Thomas, could you: $ ls /sys/devices/*/cpu* The assumption is that if a file called "cpus" exists then this a "core" PMU, while "cpumask" would indicate an uncore PMU. The exception is /sys/devices/cpu where there is no cpus but all online CPUs are assumed to be in "cpus" cpu map. On Intel one of the core PMUs has type 4 which aligns with the perf_event.h "ABI" constant PERF_TYPE_RAW, so opening a type 4 event will open it on that PMU. The question is I have is what should be the behavior be on non-Intel for type 4, for the big.little/hybrid case it seems opening it on core PMUs would make most sense and is what is done for PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE. Thanks, Ian > Thanks Thomas > -- > Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany > -- > Vorsitzender des Aufsichtsrats: Gregor Pillen > Geschäftsführung: David Faller > Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-30 14:00 ` Ian Rogers @ 2023-05-31 9:09 ` Thomas Richter 2023-05-31 20:20 ` Ian Rogers 0 siblings, 1 reply; 19+ messages in thread From: Thomas Richter @ 2023-05-31 9:09 UTC (permalink / raw) To: Ian Rogers Cc: Marc Zyngier, Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On 5/30/23 16:00, Ian Rogers wrote: > ls /sys/devices/*/cpu* Hi Ian, here is the output yo requested: # ls /sys/devices/*/cpu* cpu0 cpu3 cpu6 hotplug modalias possible smt cpu1 cpu4 cpu7 isolated offline present uevent cpu2 cpu5 dispatching kernel_max online rescan vulnerabilities # In fact it is the same as # ls /sys/devices/system/cpu/ cpu0 cpu3 cpu6 hotplug modalias possible smt cpu1 cpu4 cpu7 isolated offline present uevent cpu2 cpu5 dispatching kernel_max online rescan vulnerabilities # This directory tree has nothing to do with events for perf, it is merely used to support CPU hotplug on s390. The PMUs on s390 are # ls -ld /sys/devices/{cpum_,pai_}* drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf_diag drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_sf drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_crypto drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_ext # I hope his helps. -- Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany -- Vorsitzender des Aufsichtsrats: Gregor Pillen Geschäftsführung: David Faller Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-31 9:09 ` Thomas Richter @ 2023-05-31 20:20 ` Ian Rogers 2023-06-01 11:02 ` Thomas Richter 0 siblings, 1 reply; 19+ messages in thread From: Ian Rogers @ 2023-05-31 20:20 UTC (permalink / raw) To: Thomas Richter Cc: Marc Zyngier, Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On Wed, May 31, 2023 at 2:09 AM Thomas Richter <tmricht@linux.ibm.com> wrote: > > On 5/30/23 16:00, Ian Rogers wrote: > > ls /sys/devices/*/cpu* > > Hi Ian, > > here is the output yo requested: > > # ls /sys/devices/*/cpu* > cpu0 cpu3 cpu6 hotplug modalias possible smt > cpu1 cpu4 cpu7 isolated offline present uevent > cpu2 cpu5 dispatching kernel_max online rescan vulnerabilities > # > > In fact it is the same as > # ls /sys/devices/system/cpu/ > cpu0 cpu3 cpu6 hotplug modalias possible smt > cpu1 cpu4 cpu7 isolated offline present uevent > cpu2 cpu5 dispatching kernel_max online rescan vulnerabilities > # > This directory tree has nothing to do with events for perf, it is > merely used to support CPU hotplug on s390. > > The PMUs on s390 are > # ls -ld /sys/devices/{cpum_,pai_}* > drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf > drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf_diag > drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_sf > drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_crypto > drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_ext > # > > I hope his helps. Thanks Thomas, The perf tool has a notion of "core" and "other" PMUs - other means uncore and things like interconnect PMUs. The distinction matters as PMUs may have a list of CPUs with them, for "other" PMUs the CPU list (known in the tool as the CPU map) is a suggestion of the CPU to open events on. For example, on my laptop: ``` $ cat /sys/devices/system/cpu/online 0-15 $ cat /sys/devices/uncore_imc_free_running_0/cpumask 0 ``` So I have 16 "CPUs" and the memory controller is suggesting opening the events on CPU 0. However, if I do: ``` $ sudo perf stat -e 'uncore_imc_free_running_0/data_read/' -C 8 -a -A sleep 1 Performance counter stats for 'system wide': CPU8 4,617.60 MiB uncore_imc_free_running_0/data_read/ 1.001094684 seconds time elapsed ``` Then things are good and the event was recorded on CPU 8, even though the cpumask of the PMU only said CPU 0. For "core" PMUs the CPU map works differently. A CPU outside of the map is an error. For example, if I have a heterogeneous system with 2 CPUs, the first CPU on 1 PMU and the second CPU on a different PMU, if I try to open events for the wrong PMU type for the CPU then it should fail. The CPU map should be interpreted as the set of CPUs events are valid on. The logic to determine if a PMU is "core" or "other" is here: https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n1417 https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n609 That is, a PMU is a "core" PMU if it is called "cpu" or if there is a file called "cpus" inside the PMU's sysfs directory. It seems on s390 none of the PMUs are considered "core" and this is why we're having issues. Looking at: https://lore.kernel.org/lkml/20180416132314.33249-1-tmricht@linux.ibm.com/ It would seem to make sense that "cpu", "cpum_cf" and "cpum_sf" should all cause "is_pmu_core" to return true, From your output I suspect the issue is that cpum_cf and cpum_sf both lack the "cpus" file - which is also true on homogeneous x86's "cpu" PMU. We can fix the code with: ``` diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c index 0520aa9fe991..bdc3f3b148fc 100644 --- a/tools/perf/util/pmu.c +++ b/tools/perf/util/pmu.c @@ -1414,7 +1414,9 @@ void perf_pmu__del_formats(struct list_head *formats) bool is_pmu_core(const char *name) { - return !strcmp(name, "cpu") || is_sysfs_pmu_core(name); + return !strcmp(name, "cpu") || + !strcmp(name, "cpum_cf") || !strcmp(name, "cpum_sf") || + is_sysfs_pmu_core(name); } bool perf_pmu__supports_legacy_cache(const struct perf_pmu *pmu) ``` Wdyt? Thanks, Ian > -- > Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany > -- > Vorsitzender des Aufsichtsrats: Gregor Pillen > Geschäftsführung: David Faller > Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 > ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-31 20:20 ` Ian Rogers @ 2023-06-01 11:02 ` Thomas Richter 2023-06-01 11:18 ` Peter Zijlstra 0 siblings, 1 reply; 19+ messages in thread From: Thomas Richter @ 2023-06-01 11:02 UTC (permalink / raw) To: Ian Rogers Cc: Marc Zyngier, Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On 5/31/23 22:20, Ian Rogers wrote: > On Wed, May 31, 2023 at 2:09 AM Thomas Richter <tmricht@linux.ibm.com> wrote: >> >> On 5/30/23 16:00, Ian Rogers wrote: >>> ls /sys/devices/*/cpu* >> >> Hi Ian, >> >> here is the output yo requested: >> >> # ls /sys/devices/*/cpu* >> cpu0 cpu3 cpu6 hotplug modalias possible smt >> cpu1 cpu4 cpu7 isolated offline present uevent >> cpu2 cpu5 dispatching kernel_max online rescan vulnerabilities >> # >> >> In fact it is the same as >> # ls /sys/devices/system/cpu/ >> cpu0 cpu3 cpu6 hotplug modalias possible smt >> cpu1 cpu4 cpu7 isolated offline present uevent >> cpu2 cpu5 dispatching kernel_max online rescan vulnerabilities >> # >> This directory tree has nothing to do with events for perf, it is >> merely used to support CPU hotplug on s390. >> >> The PMUs on s390 are >> # ls -ld /sys/devices/{cpum_,pai_}* >> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf >> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf_diag >> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_sf >> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_crypto >> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_ext >> # >> >> I hope his helps. > > Thanks Thomas, > > The perf tool has a notion of "core" and "other" PMUs - other means > uncore and things like interconnect PMUs. The distinction matters as > PMUs may have a list of CPUs with them, for "other" PMUs the CPU list > (known in the tool as the CPU map) is a suggestion of the CPU to open > events on. For example, on my laptop: > ``` > $ cat /sys/devices/system/cpu/online > 0-15 > $ cat /sys/devices/uncore_imc_free_running_0/cpumask > 0 > ``` > So I have 16 "CPUs" and the memory controller is suggesting opening > the events on CPU 0. However, if I do: > ``` > $ sudo perf stat -e 'uncore_imc_free_running_0/data_read/' -C 8 -a -A sleep 1 > > Performance counter stats for 'system wide': > > CPU8 4,617.60 MiB > uncore_imc_free_running_0/data_read/ > > > 1.001094684 seconds time elapsed > ``` > Then things are good and the event was recorded on CPU 8, even though > the cpumask of the PMU only said CPU 0. > > For "core" PMUs the CPU map works differently. A CPU outside of the > map is an error. For example, if I have a heterogeneous system with 2 > CPUs, the first CPU on 1 PMU and the second CPU on a different PMU, if > I try to open events for the wrong PMU type for the CPU then it > should fail. The CPU map should be interpreted as the set of CPUs > events are valid on. > > The logic to determine if a PMU is "core" or "other" is here: > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n1417 > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n609 > > That is, a PMU is a "core" PMU if it is called "cpu" or if there is a > file called "cpus" inside the PMU's sysfs directory. It seems on s390 > none of the PMUs are considered "core" and this is why we're having > issues. > > Looking at: > https://lore.kernel.org/lkml/20180416132314.33249-1-tmricht@linux.ibm.com/ > > It would seem to make sense that "cpu", "cpum_cf" and "cpum_sf" should > all cause "is_pmu_core" to return true, From your output I suspect the > issue is that cpum_cf and cpum_sf both lack the "cpus" file - which is > also true on homogeneous x86's "cpu" PMU. We can fix the code with: > > ``` > diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c > index 0520aa9fe991..bdc3f3b148fc 100644 > --- a/tools/perf/util/pmu.c > +++ b/tools/perf/util/pmu.c > @@ -1414,7 +1414,9 @@ void perf_pmu__del_formats(struct list_head *formats) > > bool is_pmu_core(const char *name) > { > - return !strcmp(name, "cpu") || is_sysfs_pmu_core(name); > + return !strcmp(name, "cpu") || > + !strcmp(name, "cpum_cf") || !strcmp(name, "cpum_sf") || > + is_sysfs_pmu_core(name); > } > > bool perf_pmu__supports_legacy_cache(const struct perf_pmu *pmu) > ``` > > Wdyt? Thanks, > Ian > Ian, thanks very much for your help. I changed the code as you suggested and added s390's PMU cpum_cf as in bool is_pmu_core(const char *name) { return !strcmp(name, "cpu") || !strcmp(name, "cpum_cf") || is_sysfs_pmu_core(name); } The result is much better, now the - test case 6.1 fails only 2 times and for a different reason: running test 2 'r1a' FAILED tests/parse-events.c:125 Raw PMU not matched Event test failure: test 2 'r1a' rc -1 ... running test 14 'r1a:kp' FAILED tests/parse-events.c:125 Raw PMU not matched Event test failure: test 14 'r1a:kp' rc -1 - test case 10.2 fails for an unknown event bp_l1_btb_correct. 10.2: PMU event map aliases : --- start --- Using CPUID IBM,8561,703,T01,3.6,002f testing aliases core PMU cpum_cf: no alias, alias_table->name=bp_l1_btb_correct testing core PMU cpum_cf aliases: failed ---- end ---- PMU events subtest 2: FAILED! - test case 68 now succeeds: # ./perf test 68 68: Parse and process metrics : Ok # Regarding test case 6.1: Raw PMU not matched is printed in tests/parse-events.c::125 Debugging revealed that function test_event() +--> parse_events() which returns 0 +--> e->check() ---> test__checkevent_raw() This function has TEST_ASSERT_VAL("Raw PMU not matched", raw_type_match); triggering -1 because the PMU has type 8 which is not PERF_TYPE_RAW Maybe make type >= PERF_TYPE_RAW to succeed the test? Regarding test case 10.2: the event bp_l1_btb_correct does not exist on s390. It is defined in - pmu-events/empty-pmu-events.c - pmu-events/arch/test/test_soc/cpu/branch.json Not sure which one is used. But this event is unknown on s390 What is the best way to fix these issues. PS: I have the feeling, it gets complicated to have multiple hardware PMUs per platform. -- Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany -- Vorsitzender des Aufsichtsrats: Gregor Pillen Geschäftsführung: David Faller Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-06-01 11:02 ` Thomas Richter @ 2023-06-01 11:18 ` Peter Zijlstra 2023-06-01 11:20 ` Peter Zijlstra 0 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2023-06-01 11:18 UTC (permalink / raw) To: Thomas Richter Cc: Ian Rogers, Marc Zyngier, Oliver Upton, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On Thu, Jun 01, 2023 at 01:02:30PM +0200, Thomas Richter wrote: > PS: I have the feeling, it gets complicated to have multiple hardware PMUs > per platform. Recently someone was poking around giving the pmu device a parent, this would, I think, result in sysfs links, which could be used to decide what's what. Core pmus would have the CPU device as their parent, while memory controller thingies would link to the relevant node or something. Ofc. all that's future-work/pending. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-06-01 11:18 ` Peter Zijlstra @ 2023-06-01 11:20 ` Peter Zijlstra 0 siblings, 0 replies; 19+ messages in thread From: Peter Zijlstra @ 2023-06-01 11:20 UTC (permalink / raw) To: Thomas Richter Cc: Ian Rogers, Marc Zyngier, Oliver Upton, Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, kvmarm On Thu, Jun 01, 2023 at 01:18:56PM +0200, Peter Zijlstra wrote: > On Thu, Jun 01, 2023 at 01:02:30PM +0200, Thomas Richter wrote: > > > PS: I have the feeling, it gets complicated to have multiple hardware PMUs > > per platform. > > Recently someone was poking around giving the pmu device a parent, this > would, I think, result in sysfs links, which could be used to decide > what's what. > > Core pmus would have the CPU device as their parent, while memory > controller thingies would link to the relevant node or something. > > Ofc. all that's future-work/pending. https://lkml.kernel.org/r/20230404134225.13408-1-Jonathan.Cameron%40huawei.com https://lkml.kernel.org/r/20230531110011.13963-2-Jonathan.Cameron%40huawei.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code 2023-05-25 7:11 ` Oliver Upton 2023-05-25 14:20 ` Peter Zijlstra @ 2023-05-25 15:55 ` Nathan Chancellor 1 sibling, 0 replies; 19+ messages in thread From: Nathan Chancellor @ 2023-05-25 15:55 UTC (permalink / raw) To: Oliver Upton Cc: Ravi Bangoria, peterz, namhyung, eranian, acme, mark.rutland, jolsa, irogers, bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users, linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, maz, kvmarm On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote: > On Thu, May 25, 2023 at 10:46:01AM +0530, Ravi Bangoria wrote: > > On 25-May-23 3:11 AM, Nathan Chancellor wrote: > > > My apologies if this has already been reported or fixed already, I did a > > > search of lore.kernel.org and did not find anything. This patch as > > > commit 9551fbb64d09 ("perf/core: Remove pmu linear searching code") in > > > -next breaks starting QEMU with KVM enabled on two of my arm64 machines: > > > > > > $ qemu-system-aarch64 \ > > > -display none \ > > > -nodefaults \ > > > -machine virt,gic-version=max \ > > > -append 'console=ttyAMA0 earlycon' \ > > > -kernel arch/arm64/boot/Image.gz \ > > > -initrd rootfs.cpio \ > > > -cpu host \ > > > -enable-kvm \ > > > -m 512m \ > > > -smp 8 \ > > > -serial mon:stdio > > > qemu-system-aarch64: PMU: KVM_SET_DEVICE_ATTR: No such device > > > qemu-system-aarch64: failed to set irq for PMU > > > > > > In the kernel log, I see > > > > > > [ 42.944952] kvm: pmu event creation failed -2 > > > > > > I am not sure if this issue is unexpected as a result of this change or > > > if there is something that needs to change on the arm64 KVM side (it > > > appears the kernel message comes from arch/arm64/kvm/pmu-emul.c). > > > > Thanks for reporting it. > > > > Based on these detail, I feel the pmu registration failed in the host, > > most probably because pmu driver did not pass pmu name while calling > > perf_pmu_register(). Consequently kvm also failed while trying to use > > it for guest. Can you please check host kernel logs. > > The PMUv3 driver does pass a name, but it relies on getting back an > allocated pmu id as @type is -1 in the call to perf_pmu_register(). > > What actually broke is how KVM probes for a default core PMU to use for > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and > reads the pmu from the returned perf_event. The linear search had the > effect of eventually stumbling on the correct core PMU and succeeding. > > Perf folks: is this WAI for heterogenous systems? > > Either way, the whole KVM end of this scheme is a bit clunky, and I > believe it to be unneccessary at this point as we maintain a list of > core PMU instances that KVM is able to virtualize. We can just walk > that to find a default PMU to use. > > Not seeing any issues on -next with the below diff. If this works for > folks I can actually wrap it up in a patch and send it out. I can start QEMU on both the machines that had issues and my machines continue to run without any visible issues but I have never done any profile work within them. If there is any further testing or validation that I should do, I am more than happy to do so. Until then, consider it: Tested-by: Nathan Chancellor <nathan@kernel.org> > diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c > index 45727d50d18d..cbc0b662b7f8 100644 > --- a/arch/arm64/kvm/pmu-emul.c > +++ b/arch/arm64/kvm/pmu-emul.c > @@ -694,47 +694,26 @@ void kvm_host_pmu_init(struct arm_pmu *pmu) > > static struct arm_pmu *kvm_pmu_probe_armpmu(void) > { > - struct perf_event_attr attr = { }; > - struct perf_event *event; > - struct arm_pmu *pmu = NULL; > - > - /* > - * Create a dummy event that only counts user cycles. As we'll never > - * leave this function with the event being live, it will never > - * count anything. But it allows us to probe some of the PMU > - * details. Yes, this is terrible. > - */ > - attr.type = PERF_TYPE_RAW; > - attr.size = sizeof(attr); > - attr.pinned = 1; > - attr.disabled = 0; > - attr.exclude_user = 0; > - attr.exclude_kernel = 1; > - attr.exclude_hv = 1; > - attr.exclude_host = 1; > - attr.config = ARMV8_PMUV3_PERFCTR_CPU_CYCLES; > - attr.sample_period = GENMASK(63, 0); > + struct arm_pmu *arm_pmu = NULL, *tmp; > + struct arm_pmu_entry *entry; > + int cpu; > > - event = perf_event_create_kernel_counter(&attr, -1, current, > - kvm_pmu_perf_overflow, &attr); > + mutex_lock(&arm_pmus_lock); > + cpu = get_cpu(); > > - if (IS_ERR(event)) { > - pr_err_once("kvm: pmu event creation failed %ld\n", > - PTR_ERR(event)); > - return NULL; > - } > + list_for_each_entry(entry, &arm_pmus, entry) { > + tmp = entry->arm_pmu; > > - if (event->pmu) { > - pmu = to_arm_pmu(event->pmu); > - if (pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_NI || > - pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) > - pmu = NULL; > + if (cpumask_test_cpu(cpu, &tmp->supported_cpus)) { > + arm_pmu = tmp; > + break; > + } > } > > - perf_event_disable(event); > - perf_event_release_kernel(event); > + put_cpu(); > + mutex_unlock(&arm_pmus_lock); > > - return pmu; > + return arm_pmu; > } > > u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1) > > -- > Thanks, > Oliver ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2023-06-01 11:20 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20230504110003.2548-1-ravi.bangoria@amd.com>
[not found] ` <20230504110003.2548-4-ravi.bangoria@amd.com>
2023-05-24 21:41 ` [PATCH v4 3/4] perf/core: Remove pmu linear searching code Nathan Chancellor
2023-05-25 5:16 ` Ravi Bangoria
2023-05-25 7:11 ` Oliver Upton
2023-05-25 14:20 ` Peter Zijlstra
2023-05-25 15:56 ` Oliver Upton
2023-05-26 23:00 ` Ian Rogers
2023-05-27 13:32 ` Marc Zyngier
2023-05-27 17:00 ` Ian Rogers
2023-05-27 17:05 ` Ian Rogers
2023-05-27 18:38 ` Marc Zyngier
2023-05-27 19:50 ` Ian Rogers
2023-05-30 7:45 ` Thomas Richter
2023-05-30 14:00 ` Ian Rogers
2023-05-31 9:09 ` Thomas Richter
2023-05-31 20:20 ` Ian Rogers
2023-06-01 11:02 ` Thomas Richter
2023-06-01 11:18 ` Peter Zijlstra
2023-06-01 11:20 ` Peter Zijlstra
2023-05-25 15:55 ` Nathan Chancellor
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox