From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21BC9C77B7A for ; Mon, 17 Apr 2023 17:31:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229738AbjDQRbX (ORCPT ); Mon, 17 Apr 2023 13:31:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54620 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229658AbjDQRbW (ORCPT ); Mon, 17 Apr 2023 13:31:22 -0400 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7C743E6; Mon, 17 Apr 2023 10:31:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1681752679; x=1713288679; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=g4AzBVEKauljUK1zQa8QNsPzUyopAAURR4dQIly5CDI=; b=A1xU4mZipM6cmKyvxx8D8+r9oJdbU/SMnmupxPJfV2NVgMZiniIAC8zQ 5IdmJW/VBYxzVcKR4T/83lfsz4ph0iiWDg4+WrzQ6NVU+Juo6TcaBghru eCfv8UjnP07/1VevSo7EtVW9dCAy54D7E9eoa3nxw62KUZFtdGXRxF75e k/EZjeCWbmKtemlBA63UoU9YIX1p9T7bJaL/Gpi/t3J8v6B12gXIiU1XE HN/zyTOlj+vaLgPGLh7E5LtiGsQquRrN/lPzTiBqI+plFyBuQc17U2TBO GnosaWcYYAS4JVVWUVA0Kr/nDw6ddNon9Q/xWGxz4BiLYCssyXvRadVCG Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10683"; a="346800365" X-IronPort-AV: E=Sophos;i="5.99,204,1677571200"; d="scan'208";a="346800365" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Apr 2023 10:31:18 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10683"; a="690742740" X-IronPort-AV: E=Sophos;i="5.99,204,1677571200"; d="scan'208";a="690742740" Received: from linux.intel.com ([10.54.29.200]) by orsmga002.jf.intel.com with ESMTP; 17 Apr 2023 10:31:18 -0700 Received: from [10.212.151.175] (kliang2-mobl1.ccr.corp.intel.com [10.212.151.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by linux.intel.com (Postfix) with ESMTPS id E7CF5580677; Mon, 17 Apr 2023 10:31:16 -0700 (PDT) Message-ID: Date: Mon, 17 Apr 2023 13:31:15 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [PATCH v2] perf stat: Introduce skippable evsels Content-Language: en-US To: Ian Rogers Cc: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Adrian Hunter , Florian Fischer , linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org References: <20230414051922.3625666-1-irogers@google.com> <56ac61a0-ccf0-210e-e429-11062a07b8bf@linux.intel.com> <5031f492-9734-be75-3283-5961771d87c8@linux.intel.com> From: "Liang, Kan" In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-perf-users@vger.kernel.org On 2023-04-17 11:59 a.m., Ian Rogers wrote: > On Mon, Apr 17, 2023 at 6:58 AM Liang, Kan wrote: >> >> >> >> On 2023-04-14 7:03 p.m., Ian Rogers wrote: >>> On Fri, Apr 14, 2023 at 11:07 AM Liang, Kan wrote: >>>> >>>> >>>> >>>> On 2023-04-14 1:19 a.m., Ian Rogers wrote: >>>>> Perf stat with no arguments will use default events and metrics. These >>>>> events may fail to open even with kernel and hypervisor disabled. When >>>>> these fail then the permissions error appears even though they were >>>>> implicitly selected. This is particularly a problem with the automatic >>>>> selection of the TopdownL1 metric group on certain architectures like >>>>> Skylake: >>>>> >>>>> ``` >>>>> $ perf stat true >>>>> Error: >>>>> Access to performance monitoring and observability operations is limited. >>>>> Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open >>>>> access to performance monitoring and observability operations for processes >>>>> without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability. >>>>> More information can be found at 'Perf events and tool security' document: >>>>> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html >>>>> perf_event_paranoid setting is 2: >>>>> -1: Allow use of (almost) all events by all users >>>>> Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK >>>>>> = 0: Disallow raw and ftrace function tracepoint access >>>>>> = 1: Disallow CPU event access >>>>>> = 2: Disallow kernel profiling >>>>> To make the adjusted perf_event_paranoid setting permanent preserve it >>>>> in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = ) >>>>> ``` >>>>> >>>>> This patch adds skippable evsels that when they fail to open won't >>>>> fail and won't appear in output. The TopdownL1 events, from the metric >>>>> group, are marked as skippable. This turns the failure above to: >>>>> >>>>> ``` >>>>> $ perf stat true >>>>> >>>>> Performance counter stats for 'true': >>>>> >>>>> 1.26 msec task-clock:u # 0.328 CPUs utilized >>>>> 0 context-switches:u # 0.000 /sec >>>>> 0 cpu-migrations:u # 0.000 /sec >>>>> 49 page-faults:u # 38.930 K/sec >>>>> 176,449 cycles:u # 0.140 GHz (48.99%) >>>> >>>> Multiplexing? >>>> >>>> Thanks, >>>> Kan >>> >>> I may have been running a test in the background otherwise I can't >>> explain it. Repeating the test yields no multiplexing: >> >> >> The above multiplexing should be on a Skylake (since there is no >> topdownL1 printed), but the test which you repeat seems on a Tigerlake >> (has topdownL1). Could you please double check on a Skylake? > > In the best circumstances (ie no EBS_Mode, no other events, nmi > watchdog disabled) Skylake has multiplexing for TopdownL1: > No I mean the perf stat default on Skylake. With this patch, on a Skylake machine, there should be no TopdownL1 event and no multiplexing. >From your test result in the v2 description, we can see that there is no TopdownL1, which is good and expected. However, there is a (48.99%) with cycles:u event, which implies multiplexing. Could you please check what's the problem here? Also, if it's because of the backgroud, all the events should be multiplexing. But it looks like only cycle:u has multiplexing. Other events, instructions:u, branches:u and branch-misses:u work without multiplexing. That's very strange. > ``` > $ sudo perf stat --metric-no-group -M TopdownL1 --metric-no-group -a sleep 1 > > Performance counter stats for 'system wide': > > 500,145,019 INST_RETIRED.ANY # 14.2 % > tma_retiring (71.07%) > 2,402,640,337 CPU_CLK_UNHALTED.THREAD_ANY # 41.1 % > tma_frontend_bound > # 36.2 % > tma_backend_bound > # 8.4 % > tma_bad_speculation (85.63%) > 1,976,769,761 IDQ_UOPS_NOT_DELIVERED.CORE > (85.81%) > 114,069,133 INT_MISC.RECOVERY_CYCLES_ANY > (85.83%) > 684,605,487 UOPS_RETIRED.RETIRE_SLOTS > (85.83%) > 49,695,823 UOPS_RETIRED.MACRO_FUSED > (85.83%) > 860,603,122 UOPS_ISSUED.ANY > (56.70%) > > 1.014011174 seconds time elapsed > ``` > > but this isn't a regression: > https://lore.kernel.org/lkml/20200520072814.128267-1-irogers@google.com/ > > I think this is off-topic for this change. > > Thanks, > Ian > >> Thanks, >> Kan >>> >>> ``` >>> $ perf stat true >>> >>> Performance counter stats for 'true': >>> >>> 0.78 msec task-clock:u # 0.383 >>> CPUs utilized >>> 0 context-switches:u # 0.000 >>> /sec >>> 0 cpu-migrations:u # 0.000 >>> /sec >>> 47 page-faults:u # 60.174 >>> K/sec >>> 233,420 cycles:u # 0.299 GHz >>> 133,318 instructions:u # 0.57 >>> insn per cycle >>> 31,396 branches:u # 40.196 >>> M/sec >>> 2,334 branch-misses:u # 7.43% of >>> all branches >>> 1,167,100 TOPDOWN.SLOTS:u # 12.2 % >>> tma_retiring >>> # 28.9 % >>> tma_backend_bound >>> # 41.0 % >>> tma_frontend_bound >>> # 18.0 % >>> tma_bad_speculation >>> 141,882 topdown-retiring:u >>> 480,570 topdown-fe-bound:u >>> 320,380 topdown-be-bound:u >>> 224,266 topdown-bad-spec:u >>> 2,173 INT_MISC.UOP_DROPPING:u # 2.782 >>> M/sec >>> 3,323 cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u # >>> 4.254 M/sec >>> >>> >>> 0.002036744 seconds time elapsed >>> >>> 0.002252000 seconds user >>> 0.000000000 seconds sys >>> ``` >>> This is apparently on a TigerLake, which has TopdownL1 with the perf stat default. I don't have a problem with this. Thanks, Kan >>> Thanks, >>> Ian >>> >>>>> 122,905 instructions:u # 0.70 insn per cycle >>>>> 28,264 branches:u # 22.456 M/sec >>>>> 2,405 branch-misses:u # 8.51% of all branches >>>>> >>>>> 0.003834565 seconds time elapsed >>>>> >>>>> 0.000000000 seconds user >>>>> 0.004130000 seconds sys >>>>> ``` >>>>> >>>>> When the events can have kernel/hypervisor disabled, like on >>>>> Tigerlake, then it continues to succeed as: >>>>> >>>>> ``` >>>>> $ perf stat true >>>>> >>>>> Performance counter stats for 'true': >>>>> >>>>> 0.57 msec task-clock:u # 0.385 CPUs utilized >>>>> 0 context-switches:u # 0.000 /sec >>>>> 0 cpu-migrations:u # 0.000 /sec >>>>> 47 page-faults:u # 82.329 K/sec >>>>> 287,017 cycles:u # 0.503 GHz >>>>> 133,318 instructions:u # 0.46 insn per cycle >>>>> 31,396 branches:u # 54.996 M/sec >>>>> 2,442 branch-misses:u # 7.78% of all branches >>>>> 998,790 TOPDOWN.SLOTS:u # 14.5 % tma_retiring >>>>> # 27.6 % tma_backend_bound >>>>> # 40.9 % tma_frontend_bound >>>>> # 17.0 % tma_bad_speculation >>>>> 144,922 topdown-retiring:u >>>>> 411,266 topdown-fe-bound:u >>>>> 258,510 topdown-be-bound:u >>>>> 184,090 topdown-bad-spec:u >>>>> 2,585 INT_MISC.UOP_DROPPING:u # 4.528 M/sec >>>>> 3,434 cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u # 6.015 M/sec >>>>> >>>>> 0.001480954 seconds time elapsed >>>>> >>>>> 0.000000000 seconds user >>>>> 0.001686000 seconds sys >>>>> ``` >>>>> >>>>> And this likewise works if paranoia allows or running as root. >>>>> >>>>> v2. Don't display the skipped events as or . >>>>> >>>>> Signed-off-by: Ian Rogers >>>>> --- >>>>> tools/perf/builtin-stat.c | 39 ++++++++++++++++++++++++++-------- >>>>> tools/perf/util/evsel.c | 15 +++++++++++-- >>>>> tools/perf/util/evsel.h | 1 + >>>>> tools/perf/util/stat-display.c | 4 ++++ >>>>> 4 files changed, 48 insertions(+), 11 deletions(-) >>>>> >>>>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c >>>>> index d3cbee7460fc..7a641a67486d 100644 >>>>> --- a/tools/perf/builtin-stat.c >>>>> +++ b/tools/perf/builtin-stat.c >>>>> @@ -667,6 +667,13 @@ static enum counter_recovery stat_handle_error(struct evsel *counter) >>>>> evsel_list->core.threads->err_thread = -1; >>>>> return COUNTER_RETRY; >>>>> } >>>>> + } else if (counter->skippable) { >>>>> + if (verbose > 0) >>>>> + ui__warning("skipping event %s that kernel failed to open .\n", >>>>> + evsel__name(counter)); >>>>> + counter->supported = false; >>>>> + counter->errored = true; >>>>> + return COUNTER_SKIP; >>>>> } >>>>> >>>>> evsel__open_strerror(counter, &target, errno, msg, sizeof(msg)); >>>>> @@ -1885,15 +1892,29 @@ static int add_default_attributes(void) >>>>> * Add TopdownL1 metrics if they exist. To minimize >>>>> * multiplexing, don't request threshold computation. >>>>> */ >>>>> - if (metricgroup__has_metric("TopdownL1") && >>>>> - metricgroup__parse_groups(evsel_list, "TopdownL1", >>>>> - /*metric_no_group=*/false, >>>>> - /*metric_no_merge=*/false, >>>>> - /*metric_no_threshold=*/true, >>>>> - stat_config.user_requested_cpu_list, >>>>> - stat_config.system_wide, >>>>> - &stat_config.metric_events) < 0) >>>>> - return -1; >>>>> + if (metricgroup__has_metric("TopdownL1")) { >>>>> + struct evlist *metric_evlist = evlist__new(); >>>>> + struct evsel *metric_evsel; >>>>> + >>>>> + if (!metric_evlist) >>>>> + return -1; >>>>> + >>>>> + if (metricgroup__parse_groups(metric_evlist, "TopdownL1", >>>>> + /*metric_no_group=*/false, >>>>> + /*metric_no_merge=*/false, >>>>> + /*metric_no_threshold=*/true, >>>>> + stat_config.user_requested_cpu_list, >>>>> + stat_config.system_wide, >>>>> + &stat_config.metric_events) < 0) >>>>> + return -1; >>>>> + >>>>> + evlist__for_each_entry(metric_evlist, metric_evsel) { >>>>> + metric_evsel->skippable = true; >>>>> + } >>>>> + evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries); >>>>> + evlist__delete(metric_evlist); >>>>> + } >>>>> + >>>>> /* Platform specific attrs */ >>>>> if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0) >>>>> return -1; >>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c >>>>> index a85a987128aa..83a65f771666 100644 >>>>> --- a/tools/perf/util/evsel.c >>>>> +++ b/tools/perf/util/evsel.c >>>>> @@ -290,6 +290,7 @@ void evsel__init(struct evsel *evsel, >>>>> evsel->per_pkg_mask = NULL; >>>>> evsel->collect_stat = false; >>>>> evsel->pmu_name = NULL; >>>>> + evsel->skippable = false; >>>>> } >>>>> >>>>> struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx) >>>>> @@ -1720,9 +1721,13 @@ static int get_group_fd(struct evsel *evsel, int cpu_map_idx, int thread) >>>>> return -1; >>>>> >>>>> fd = FD(leader, cpu_map_idx, thread); >>>>> - BUG_ON(fd == -1); >>>>> + BUG_ON(fd == -1 && !leader->skippable); >>>>> >>>>> - return fd; >>>>> + /* >>>>> + * When the leader has been skipped, return -2 to distinguish from no >>>>> + * group leader case. >>>>> + */ >>>>> + return fd == -1 ? -2 : fd; >>>>> } >>>>> >>>>> static void evsel__remove_fd(struct evsel *pos, int nr_cpus, int nr_threads, int thread_idx) >>>>> @@ -2104,6 +2109,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus, >>>>> >>>>> group_fd = get_group_fd(evsel, idx, thread); >>>>> >>>>> + if (group_fd == -2) { >>>>> + pr_debug("broken group leader for %s\n", evsel->name); >>>>> + err = -EINVAL; >>>>> + goto out_close; >>>>> + } >>>>> + >>>>> test_attr__ready(); >>>>> >>>>> /* Debug message used by test scripts */ >>>>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h >>>>> index 68072ec655ce..98afe3351176 100644 >>>>> --- a/tools/perf/util/evsel.h >>>>> +++ b/tools/perf/util/evsel.h >>>>> @@ -95,6 +95,7 @@ struct evsel { >>>>> bool weak_group; >>>>> bool bpf_counter; >>>>> bool use_config_name; >>>>> + bool skippable; >>>>> int bpf_fd; >>>>> struct bpf_object *bpf_obj; >>>>> struct list_head config_terms; >>>>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c >>>>> index e6035ecbeee8..6b46bbb3d322 100644 >>>>> --- a/tools/perf/util/stat-display.c >>>>> +++ b/tools/perf/util/stat-display.c >>>>> @@ -810,6 +810,10 @@ static bool should_skip_zero_counter(struct perf_stat_config *config, >>>>> struct perf_cpu cpu; >>>>> int idx; >>>>> >>>>> + /* Skip counters that were speculatively/default enabled rather than requested. */ >>>>> + if (counter->skippable) >>>>> + return true; >>>>> + >>>>> /* >>>>> * Skip value 0 when enabling --per-thread globally, >>>>> * otherwise it will have too many 0 output.