From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4C9F2274FF1; Wed, 16 Jul 2025 20:28:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752697688; cv=none; b=N0jD6LN41vfKAmGHNR8zPL9Gi/o50dx0iMc0pdteAwsZzLp+6997qbtHv4xIP5lLYvCyGFxIjgQr2R7QPST+NGN4luTGeJm9i4YCn4DXa9Om1bWWez0jT3LqXtSyNd953yNLfs3qN6R5vzVUaSSYREf8pvpJqmqCZGE4ZBrZhZc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752697688; c=relaxed/simple; bh=brHo0opNvg7cbMIzp5qYocuQ4Uf/Pi0EVD1qO32ApB0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=MnOXAmjP4/GxpvMdvkB8Ed0BpG6tQQeg/ruUzXlEI2py9da01c+t3E8YcrwlTtf3Ykv6OB4Y3O9FD8m38IMFiRrFDDD0WPubzwQnrbm4+lz0DoW59nOoUSM2LP1AGCJzbArGeSpR2c5OIzdnKAOHyPsEpVygEanFaJaiZAYRjLM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=XS9wW+kC; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="XS9wW+kC" Received: by smtp.kernel.org (Postfix) with ESMTPSA id E5057C4CEE7; Wed, 16 Jul 2025 20:28:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1752697687; bh=brHo0opNvg7cbMIzp5qYocuQ4Uf/Pi0EVD1qO32ApB0=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=XS9wW+kCAdkIs5Bt2XOj2EhOPkuGPB92ZEvnDz/jfOInfmJCwXIbUYlJOwBm5c5C4 SxQhGI5YL9BpuiZX7QsNK5MSlb0s7xfn0e6vkX/uW8mbEVLs2aXFfxpa252kxruxTq 8BL8JKk3oewMZzEva9iMDYJ5o+TTm7B65eS9c2402Dy+XHytd5KUOprfJjZZNIWElb 6h4FWhOqCuVNNwsbSzyN2HBWpIH1uIFjlZlqrXW0F3n2xNrFhuyUkSX0VFJZ6NuRg8 3M5TPVpBWcerUE4g3ExO4XEFF//YqHQ40uaXOy4/HN5Vv0WfZYp1yadLcRiN5+0tDy 8HPsURqRtwzdw== Date: Wed, 16 Jul 2025 13:28:05 -0700 From: Namhyung Kim To: Ian Rogers Cc: Thomas Falcon , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Adrian Hunter , Kan Liang , Ben Gainey , James Clark , Howard Chu , Weilin Wang , Levi Yun , "Dr. David Alan Gilbert" , Zhongqiu Han , Blake Jones , Yicong Yang , Anubhav Shelat , Thomas Richter , Jean-Philippe Romain , Song Liu , linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 12/12] perf parse-events: Support user CPUs mixed with threads/processes Message-ID: References: <20250627192417.1157736-1-irogers@google.com> <20250627192417.1157736-13-irogers@google.com> Precedence: bulk X-Mailing-List: linux-perf-users@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20250627192417.1157736-13-irogers@google.com> On Fri, Jun 27, 2025 at 12:24:17PM -0700, Ian Rogers wrote: > Counting events system-wide with a specified CPU prior to this change worked: > ``` > $ perf stat -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' -a sleep 1 > > Performance counter stats for 'system wide': > > 59,393,419,099 msr/tsc/ > 33,927,965,927 msr/tsc,cpu=cpu_core/ > 25,465,608,044 msr/tsc,cpu=cpu_atom/ > ``` > > However, when counting with process the counts became system wide: > ``` > $ perf stat -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' perf test -F 10 > 10.1: Basic parsing test : Ok > 10.2: Parsing without PMU name : Ok > 10.3: Parsing with PMU name : Ok > > Performance counter stats for 'perf test -F 10': > > 59,233,549 msr/tsc/ > 59,227,556 msr/tsc,cpu=cpu_core/ > 59,224,053 msr/tsc,cpu=cpu_atom/ > ``` > > Make the handling of CPU maps with event parsing clearer. When an > event is parsed creating an evsel the cpus should be either the PMU's > cpumask or user specified CPUs. > > Update perf_evlist__propagate_maps so that it doesn't clobber the user > specified CPUs. Try to make the behavior clearer, firstly fix up > missing cpumasks. Next, perform sanity checks and adjustments from the > global evlist CPU requests and for the PMU including simplifying to > the "any CPU"(-1) value. Finally remove the event if the cpumask is > empty. > > So that events are opened with a CPU and a thread change stat's > create_perf_stat_counter to give both. > > With the change things are fixed: > ``` > $ perf stat --no-scale -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' perf test -F 10 > 10.1: Basic parsing test : Ok > 10.2: Parsing without PMU name : Ok > 10.3: Parsing with PMU name : Ok > > Performance counter stats for 'perf test -F 10': > > 63,704,975 msr/tsc/ > 47,060,704 msr/tsc,cpu=cpu_core/ (4.62%) > 16,640,591 msr/tsc,cpu=cpu_atom/ (2.18%) > ``` > > However, note the "--no-scale" option is used. This is necessary as > the running time for the event on the counter isn't the same as the > enabled time because the thread doesn't necessarily run on the CPUs > specified for the counter. All counter values are scaled with: > > scaled_value = value * time_enabled / time_running > > and so without --no-scale the scaled_value becomes very large. This > problem already exists on hybrid systems for the same reason. Here are > 2 runs of the same code with an instructions event that counts the > same on both types of core, there is no real multiplexing happening on > the event: This is unfortunate. The event is in a task context but also has a CPU constraint. So it's not schedulable on other CPUs. A problem is that it cannot distinguish the real multiplexing.. > > ``` > $ perf stat -e instructions perf test -F 10 > ... > Performance counter stats for 'perf test -F 10': > > 87,896,447 cpu_atom/instructions/ (14.37%) > 98,171,964 cpu_core/instructions/ (85.63%) > ... > $ perf stat --no-scale -e instructions perf test -F 10 > ... > Performance counter stats for 'perf test -F 10': > > 13,069,890 cpu_atom/instructions/ (19.32%) > 83,460,274 cpu_core/instructions/ (80.68%) > ... > ``` > The scaling has inflated per-PMU instruction counts and the overall > count by 2x. > > To fix this the kernel needs changing when a task+CPU event (or just > task event on hybrid) is scheduled out. A fix could be that the state > isn't inactive but off for such events, so that time_enabled counts > don't accumulate on them. Right, maybe we need to add a new state (UNSCHEDULABLE?) to skip updating the enabled time. Thanks, Namhyung > > Signed-off-by: Ian Rogers > --- > tools/lib/perf/evlist.c | 118 ++++++++++++++++++++++----------- > tools/perf/util/parse-events.c | 10 ++- > tools/perf/util/stat.c | 6 +- > 3 files changed, 86 insertions(+), 48 deletions(-) > > diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c > index 9d9dec21f510..2d2236400220 100644 > --- a/tools/lib/perf/evlist.c > +++ b/tools/lib/perf/evlist.c > @@ -36,49 +36,87 @@ void perf_evlist__init(struct perf_evlist *evlist) > static void __perf_evlist__propagate_maps(struct perf_evlist *evlist, > struct perf_evsel *evsel) > { > - if (evsel->system_wide) { > - /* System wide: set the cpu map of the evsel to all online CPUs. */ > - perf_cpu_map__put(evsel->cpus); > - evsel->cpus = perf_cpu_map__new_online_cpus(); > - } else if (evlist->has_user_cpus && evsel->is_pmu_core) { > - /* > - * User requested CPUs on a core PMU, ensure the requested CPUs > - * are valid by intersecting with those of the PMU. > - */ > + if (perf_cpu_map__is_empty(evsel->cpus)) { > + if (perf_cpu_map__is_empty(evsel->pmu_cpus)) { > + /* > + * Assume the unset PMU cpus were for a system-wide > + * event, like a software or tracepoint. > + */ > + evsel->pmu_cpus = perf_cpu_map__new_online_cpus(); > + } > + if (evlist->has_user_cpus && !evsel->system_wide) { > + /* > + * Use the user CPUs unless the evsel is set to be > + * system wide, such as the dummy event. > + */ > + evsel->cpus = perf_cpu_map__get(evlist->user_requested_cpus); > + } else { > + /* > + * System wide and other modes, assume the cpu map > + * should be set to all PMU CPUs. > + */ > + evsel->cpus = perf_cpu_map__get(evsel->pmu_cpus); > + } > + } > + /* > + * Avoid "any CPU"(-1) for uncore and PMUs that require a CPU, even if > + * requested. > + */ > + if (evsel->requires_cpu && perf_cpu_map__has_any_cpu(evsel->cpus)) { > perf_cpu_map__put(evsel->cpus); > - evsel->cpus = perf_cpu_map__intersect(evlist->user_requested_cpus, evsel->pmu_cpus); > + evsel->cpus = perf_cpu_map__get(evsel->pmu_cpus); > + } > > - /* > - * Empty cpu lists would eventually get opened as "any" so remove > - * genuinely empty ones before they're opened in the wrong place. > - */ > - if (perf_cpu_map__is_empty(evsel->cpus)) { > - struct perf_evsel *next = perf_evlist__next(evlist, evsel); > - > - perf_evlist__remove(evlist, evsel); > - /* Keep idx contiguous */ > - if (next) > - list_for_each_entry_from(next, &evlist->entries, node) > - next->idx--; > + /* > + * Globally requested CPUs replace user requested unless the evsel is > + * set to be system wide. > + */ > + if (evlist->has_user_cpus && !evsel->system_wide) { > + assert(!perf_cpu_map__has_any_cpu(evlist->user_requested_cpus)); > + if (!perf_cpu_map__equal(evsel->cpus, evlist->user_requested_cpus)) { > + perf_cpu_map__put(evsel->cpus); > + evsel->cpus = perf_cpu_map__get(evlist->user_requested_cpus); > } > - } else if (!evsel->pmu_cpus || evlist->has_user_cpus || > - (!evsel->requires_cpu && perf_cpu_map__has_any_cpu(evlist->user_requested_cpus))) { > - /* > - * The PMU didn't specify a default cpu map, this isn't a core > - * event and the user requested CPUs or the evlist user > - * requested CPUs have the "any CPU" (aka dummy) CPU value. In > - * which case use the user requested CPUs rather than the PMU > - * ones. > - */ > + } > + > + /* Ensure cpus only references valid PMU CPUs. */ > + if (!perf_cpu_map__has_any_cpu(evsel->cpus) && > + !perf_cpu_map__is_subset(evsel->pmu_cpus, evsel->cpus)) { > + struct perf_cpu_map *tmp = perf_cpu_map__intersect(evsel->pmu_cpus, evsel->cpus); > + > perf_cpu_map__put(evsel->cpus); > - evsel->cpus = perf_cpu_map__get(evlist->user_requested_cpus); > - } else if (evsel->cpus != evsel->pmu_cpus) { > - /* > - * No user requested cpu map but the PMU cpu map doesn't match > - * the evsel's. Reset it back to the PMU cpu map. > - */ > + evsel->cpus = tmp; > + } > + > + /* > + * Was event requested on all the PMU's CPUs but the user requested is > + * any CPU (-1)? If so switch to using any CPU (-1) to reduce the number > + * of events. > + */ > + if (!evsel->system_wide && > + perf_cpu_map__equal(evsel->cpus, evsel->pmu_cpus) && > + perf_cpu_map__has_any_cpu(evlist->user_requested_cpus)) { > perf_cpu_map__put(evsel->cpus); > - evsel->cpus = perf_cpu_map__get(evsel->pmu_cpus); > + evsel->cpus = perf_cpu_map__get(evlist->user_requested_cpus); > + } > + > + /* Sanity check assert before the evsel is potentially removed. */ > + assert(!evsel->requires_cpu || !perf_cpu_map__has_any_cpu(evsel->cpus)); > + > + /* > + * Empty cpu lists would eventually get opened as "any" so remove > + * genuinely empty ones before they're opened in the wrong place. > + */ > + if (perf_cpu_map__is_empty(evsel->cpus)) { > + struct perf_evsel *next = perf_evlist__next(evlist, evsel); > + > + perf_evlist__remove(evlist, evsel); > + /* Keep idx contiguous */ > + if (next) > + list_for_each_entry_from(next, &evlist->entries, node) > + next->idx--; > + > + return; > } > > if (evsel->system_wide) { > @@ -98,6 +136,10 @@ static void perf_evlist__propagate_maps(struct perf_evlist *evlist) > > evlist->needs_map_propagation = true; > > + /* Clear the all_cpus set which will be merged into during propagation. */ > + perf_cpu_map__put(evlist->all_cpus); > + evlist->all_cpus = NULL; > + > list_for_each_entry_safe(evsel, n, &evlist->entries, node) > __perf_evlist__propagate_maps(evlist, evsel); > } > diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c > index 4092a43aa84e..0ff7ae75d8f9 100644 > --- a/tools/perf/util/parse-events.c > +++ b/tools/perf/util/parse-events.c > @@ -313,20 +313,18 @@ __add_event(struct list_head *list, int *idx, > if (pmu) { > is_pmu_core = pmu->is_core; > pmu_cpus = perf_cpu_map__get(pmu->cpus); > + if (perf_cpu_map__is_empty(pmu_cpus)) > + pmu_cpus = cpu_map__online(); > } else { > is_pmu_core = (attr->type == PERF_TYPE_HARDWARE || > attr->type == PERF_TYPE_HW_CACHE); > pmu_cpus = is_pmu_core ? cpu_map__online() : NULL; > } > > - if (has_user_cpus) { > + if (has_user_cpus) > cpus = perf_cpu_map__get(user_cpus); > - /* Existing behavior that pmu_cpus matches the given user ones. */ > - perf_cpu_map__put(pmu_cpus); > - pmu_cpus = perf_cpu_map__get(user_cpus); > - } else { > + else > cpus = perf_cpu_map__get(pmu_cpus); > - } > > if (init_attr) > event_attr_init(attr); > diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c > index 355a7d5c8ab8..8d3bcdb69d37 100644 > --- a/tools/perf/util/stat.c > +++ b/tools/perf/util/stat.c > @@ -769,8 +769,6 @@ int create_perf_stat_counter(struct evsel *evsel, > attr->enable_on_exec = 1; > } > > - if (target__has_cpu(target) && !target__has_per_thread(target)) > - return evsel__open_per_cpu(evsel, evsel__cpus(evsel), cpu_map_idx); > - > - return evsel__open_per_thread(evsel, evsel->core.threads); > + return evsel__open_per_cpu_and_thread(evsel, evsel__cpus(evsel), cpu_map_idx, > + evsel->core.threads); > } > -- > 2.50.0.727.gbf7dc18ff4-goog >