From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB5FDC43381 for ; Mon, 15 Mar 2021 13:11:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 954F264EB3 for ; Mon, 15 Mar 2021 13:11:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230000AbhCONKi (ORCPT ); Mon, 15 Mar 2021 09:10:38 -0400 Received: from mail.kernel.org ([198.145.29.99]:38748 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229807AbhCONKR (ORCPT ); Mon, 15 Mar 2021 09:10:17 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id AA49A64E74; Mon, 15 Mar 2021 13:10:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615813817; bh=ICd/Atj2FZXZ0iL/M9SMQR97uK2V17AYDzyyz2GmxsA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=acX2vCxhcUiwzCPU2wbxWQavkq6vI2lb8d8timNCAEOGqa8nGqSYqB77e7TeZwDBC zMm/WhnabaLO+X1UMMywJJaRLpIoXMgKvlWd34huPEGv7o49OBelCHQGQLvx7WZJjz S9/81127GYN3W4tEGz+eGLFvm3EC2OIxsB0qCltXr783Vh61OFIIKyF8FtB2XSZ/nW K79QEcMdjzjM1IRoVasGVIDZs4+WnDRsSXBHVSmaw1SFr8DBXu4Kzsx24Pa5/U4t8U yTn/uqoIOc1PID2WzWpMRf5/9yjFB/VXD9wvnhlyv28D2zF7oHDwYqtOp1z5G8aVMi pt8WQKrdEBpGA== Received: by quaco.ghostprotocols.net (Postfix, from userid 1000) id 680E440647; Mon, 15 Mar 2021 10:10:14 -0300 (-03) Date: Mon, 15 Mar 2021 10:10:14 -0300 From: Arnaldo Carvalho de Melo To: Song Liu , Andi Kleen , Jiri Olsa Cc: linux-kernel , Kernel Team , "namhyung@kernel.org" , "linux-perf-users@ghostprotocols.net" Subject: Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF Message-ID: References: <20210312020257.197137-1-songliubraving@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Url: http://acmel.wordpress.com Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Em Fri, Mar 12, 2021 at 06:52:39PM +0000, Song Liu escreveu: > > On Mar 12, 2021, at 6:24 AM, Arnaldo Carvalho de Melo wrote: > > Em Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu escreveu: > >> perf uses performance monitoring counters (PMCs) to monitor system > >> performance. The PMCs are limited hardware resources. For example, > >> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. > >> > >> Modern data center systems use these PMCs in many different ways: > >> system level monitoring, (maybe nested) container level monitoring, per > >> process monitoring, profiling (in sample mode), etc. In some cases, > >> there are more active perf_events than available hardware PMCs. To allow > >> all perf_events to have a chance to run, it is necessary to do expensive > >> time multiplexing of events. > >> > >> On the other hand, many monitoring tools count the common metrics (cycles, > >> instructions). It is a waste to have multiple tools create multiple > >> perf_events of "cycles" and occupy multiple PMCs. > >> > >> bperf tries to reduce such wastes by allowing multiple perf_events of > >> "cycles" or "instructions" (at different scopes) to share PMUs. Instead > >> of having each perf-stat session to read its own perf_events, bperf uses > >> BPF programs to read the perf_events and aggregate readings to BPF maps. > >> Then, the perf-stat session(s) reads the values from these BPF maps. > >> > >> Please refer to the comment before the definition of bperf_ops for the > >> description of bperf architecture. > >> > >> bperf is off by default. To enable it, pass --use-bpf option to perf-stat. > >> bperf uses a BPF hashmap to share information about BPF programs and maps > >> used by bperf. This map is pinned to bpffs. The default address is > >> /sys/fs/bpf/bperf_attr_map. The user could change the address with option > >> --attr-map. > >> > >> --- > >> Known limitations: > >> 1. Do not support per cgroup events; > >> 2. Do not support monitoring of BPF program (perf-stat -b); > >> 3. Do not support event groups. > >> The following commands have been tested: > >> > >> perf stat --use-bpf -e cycles -a > >> perf stat --use-bpf -e cycles -C 1,3,4 > >> perf stat --use-bpf -e cycles -p 123 > >> perf stat --use-bpf -e cycles -t 100,101 > >> @@ -925,15 +931,15 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) > >> /* > >> * Enable counters and exec the command: > >> */ > >> - t0 = rdclock(); > >> - clock_gettime(CLOCK_MONOTONIC, &ref_time); > >> - > >> if (forks) { > >> evlist__start_workload(evsel_list); > >> err = enable_counters(); > >> if (err) > >> return -1; > >> > >> + t0 = rdclock(); > >> + clock_gettime(CLOCK_MONOTONIC, &ref_time); > >> + > >> if (interval || timeout || evlist__ctlfd_initialized(evsel_list)) > >> status = dispatch_events(forks, timeout, interval, ×); > >> if (child_pid != -1) { > >> @@ -954,6 +960,10 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) > >> err = enable_counters(); > >> if (err) > >> return -1; > >> + > >> + t0 = rdclock(); > >> + clock_gettime(CLOCK_MONOTONIC, &ref_time); > >> + > >> status = dispatch_events(forks, timeout, interval, ×); > >> } > >> > > > > The above two hunks seems out of place, i.e. can they go to a different > > patch and with an explanation about why this is needed? > > Actually, I am still debating whether we want the above change in a separate > patch. It is related to the following change. > > [...] > > >> + /* > >> + * Attahcing the skeleton takes non-trivial time (0.2s+ on a kernel > >> + * with some debug options enabled). This shows as a longer first > >> + * interval: > >> + * > >> + * # perf stat -e cycles -a -I 1000 > >> + * # time counts unit events > >> + * 1.267634674 26,259,166,523 cycles > >> + * 2.271637827 22,550,822,286 cycles > >> + * 3.275406553 22,852,583,744 cycles > >> + * > >> + * Fix this by zeroing accum_readings after attaching the program. > >> + */ > >> + bperf_sync_counters(evsel); > >> + entry_cnt = bpf_map__max_entries(skel->maps.accum_readings); > >> + memset(values, 0, sizeof(struct bpf_perf_event_value) * num_cpu_bpf); > >> + > >> + for (i = 0; i < entry_cnt; i++) { > >> + bpf_map_update_elem(bpf_map__fd(skel->maps.accum_readings), > >> + &i, values, BPF_ANY); > >> + } > >> + return 0; > >> +} > > Attaching the skeleton takes non-trivial time, so that we get a bigger first > interval (1.26s in the example above). To fix this, in __run_perf_stat(), we > get t0 and ref_time after enable_counters(). > > Maybe a comment in __run_perf_stat() is better than a separate patch? I still think that there is value in taking those measurements after we enable the counters, as, for instance, for interval mode we want measurements with the counters enabled, whatever happens before the counters are enabled is just startup time, etc. Jiri, Andi? - Arnaldo