From: "Liang, Kan" <kan.liang@linux.intel.com>
To: Ian Rogers <irogers@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Arnaldo Carvalho de Melo <acme@kernel.org>,
Namhyung Kim <namhyung@kernel.org>,
Mark Rutland <mark.rutland@arm.com>,
Alexander Shishkin <alexander.shishkin@linux.intel.com>,
Jiri Olsa <jolsa@kernel.org>,
Adrian Hunter <adrian.hunter@intel.com>,
Bjorn Helgaas <bhelgaas@google.com>,
Jonathan Corbet <corbet@lwn.net>,
James Clark <james.clark@arm.com>,
Ravi Bangoria <ravi.bangoria@amd.com>,
Dominique Martinet <asmadeus@codewreck.org>,
linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org,
Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com>,
ananth.narayan@amd.com, gautham.shenoy@amd.com,
kprateek.nayak@amd.com, sandipan.das@amd.com
Subject: Re: [PATCH v2 3/6] perf pmu: Add support for event.cpus files in sysfs
Date: Mon, 22 Jul 2024 12:45:32 -0400 [thread overview]
Message-ID: <aef9d7fd-ac33-45a3-91ed-2b4d3f0f70b5@linux.intel.com> (raw)
In-Reply-To: <CAP-5=fVdiFYVP8RsE+JuhOCoGSyybv0ZUn=Sixv1ZcEe8G7=2A@mail.gmail.com>
On 2024-07-22 11:43 a.m., Ian Rogers wrote:
> On Mon, Jul 22, 2024 at 6:57 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 2024-07-19 6:02 p.m., Ian Rogers wrote:
>>> On Fri, Jul 19, 2024 at 9:35 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>> On 2024-07-19 10:59 a.m., Ian Rogers wrote:
>>>>> Thanks Kan. I'm not wondering about a case of 2 CPUs, say on CPU0 and
>>>>> solely its perf event context, I want to know its core power and
>>>>> package power as a group so I never record one without the other. That
>>>>> grouping wouldn't be possible with 2 PMUs.
>>>>
>>>> For power, to be honest, I don't think it improves anything. It gives
>>>> users a false image that perf can group these counters.
>>>> But the truth is that perf cannot. The power counters are all
>>>> free-running counters. It's impossible to co-schedule them (which
>>>> requires a global mechanism to disable/enable all counters, e.g.,
>>>> GLOBAL_CTRL for core PMU). The kernel still has to read the counters one
>>>> by one while the counters keep running. There are no differences with or
>>>> without a group for the power events.
>>>
>>> Ok, so power should copy cstate with _core, _pkg, etc. I agree the
>>> difference is small and I like the idea of being consistent. Do we
>>> want to add "event.cpus" support to the tool anyway for potential
>>> future uses?
>>
>> The only thing I can imagine is that it may be used to disclose the
>> event constraint information, Or even more to update/override the event
>> constraint information (which requires kernel update.). But what I'm
>> worried about is that it may be abused. It's very easy to confuse an
>> event and a counter in a PMU.
>
> So you mean if you have a dual socket machine and an uncore PMU with a
> cpumask of "0,48" you worry that someone setting an event on CPU 47
> may think they are getting a CPU on the second socket? Perhaps if the
> user can express an intent to the tool, say "perf stat
> -randomly-select-uncore-cpus ...", then this can be avoided. I'm not
> sure I'm worried about this as specifying the CPU for an event to use
> is already something of a more niche/advanced thing to be doing.
>
The perf tool can specify any CPU the users want, as long as the kernel
can respond properly. I'm not worried about it. The "event.cpus" is
exposed by the kernel. The main concern is also from the kernel side.
Some drivers may use it to distinguish different scopes of counters. So
they can combine various types of counters into a single PMU, which may
break some rules they don't realize. An example is the above 'group' rule.
>>> This would at least avoid problems with newer kernels and
>>> older perf tools were we to find a good use for it.
>>>
>>>>> My understanding had been that for core PMUs a "perf stat -C" option
>>>>> would choose the particular CPU to count the event on, for an uncore
>>>>> PMU the -C option would override the cpumask's "default" value. We
>>>>> have code to validate this:
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evlist.c?h=perf-tools-next#n2522
>>>>> But it seems now that overriding an uncore PMU's default CPU is
>>>>> ignored.
>>>>
>>>> For the uncore driver, no matter what -C set, it writes the default CPU
>>>> back.
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/arch/x86/events/intel/uncore.c#n760
>>>>
>>>>> If you did:
>>>>> $ perf stat -C 1 -e data_read -a sleep 0.1
>>>>> Then the tool thinks data_read is on CPU1 and will set its thread
>>>>> affinity to CPU1 to avoid IPIs. It seems to fix this we need to just
>>>>> throw away the -C option.
>>>> The perf tool can still read the the counter from CPU1 and no IPIs
>>>> because of the PMU_EV_CAP_READ_ACTIVE_PKG().
>>>>
>>>> Not quite sure, but it seems only the open and close may be impacted and
>>>> silently changed to CPU0.
>>>
>>> There's also enable/disable. Andi did the work and there were some
>>> notable gains but likely more on core events. Ultimately it'd be nice
>>> to be opening, closing.. everything in parallel given the calls are
>>> slow and the work is embarrassingly parallel.
>>> It feels like the cpumasks for uncore could still do with some cleanup
>>> wrt -C I'm just unsure at the moment what this should be. Tbh, I'm
>>> tempted to rewrite evlist propagate maps as someone may look at it and
>>> think I believe in what it is doing. The parallel stuff we should grab
>>> Riccardo's past work.
>>
>> For core PMU, the parallel may not work, because the core PMU is usually
>> MSR based. Perf has to access the MSRs on the specific CPU. IPIs may be
>> triggered if the users try to operate them from the other CPUs.
>
> Right, I think the idea would be to have as many threads as you have
> CPUs then give each thread affinity to a different CPU. The work done
> on the thread would match the CPU they have affinity with to avoid the
> IPIs. Because of the use of RCU in the kernel perf code it is possible
> to hit RCU synchronize where IPIs are sent after 200ms IIRC. If you
> get an RCU synchronize needing an IPI then "200ms x num CPUs" can mean
> seconds of delay in a serial implementation (500 CPUs would be 100
> seconds). With parallel code a worst case slow down shouldn't increase
> with the number of CPUs. On laptops, .. this doesn't matter much.
>
>> But the parallel is good for the counters in the MMIO space. The
>> counters can be accessed from any CPU. There are more and more counters
>> which are moved to the MMIO space, e.g., new uncore PMUs, IOMMU PMU,
>> TMPI (for power), etc.
>
> Sounds good but I'm wondering how we can get the tool in the right
> place for doing affinity games. For the MMIO case life's good and we
> don't care. How can the tool know one case from another? Should the
> tool always be just following the cpumask? What about on ARM where the
> cpumasks are broken?
The cpumask cannot tell such information. For example, the counters of
the uncore PMUs can be located in three different places, MSRs, the PCI
config space, and the MMIO space. But the scope of the uncore counters
is the same. So the cpumask is the same for various uncore PMUs.
To get the information, the kernel has to be updated, e.g., add a
caps/counter_type in sysfs.
Thanks,
Kan
>
> Thanks,
> Ian
>
>> Thanks,
>> Kan
>>>
>>> Thanks,
>>> Ian
>>>
>>>
>>>> Thanks,
>>>> Kan
>>>>>
>>>>>>> 2) do the /sys/devices/<pmu>/events/event.(unit|scale|per-pkg|snapshot)
>>>>>>> files parse correctly and have a corresponding event.
>>>>>>> 3) keep adding opening events on the PMU to a group to make sure that
>>>>>>> when counters are exhausted the perf_event_open fails (I've seen this
>>>>>>> bug on AMD)
>>>>>>> 4) are the values in the type file unique
>>>>>>>
>>>>>>
>>>>>> The rest sounds good to me.
>>>>>
>>>>> Cool. Let me know if you can think of more.
>>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>> Thanks,
>>>>>> Kan
>>>>>
>>>
>
next prev parent reply other threads:[~2024-07-22 16:45 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-18 0:30 [PATCH v2 0/6] Add support for sysfs event.cpus and cpu event term Ian Rogers
2024-07-18 0:30 ` [PATCH v2 1/6] perf pmu: Merge boolean sysfs event option parsing Ian Rogers
2024-07-18 0:30 ` [PATCH v2 2/6] perf parse-events: Pass cpu_list as a perf_cpu_map in __add_event Ian Rogers
2024-07-18 0:30 ` [PATCH v2 3/6] perf pmu: Add support for event.cpus files in sysfs Ian Rogers
2024-07-18 14:33 ` Liang, Kan
2024-07-18 15:39 ` Ian Rogers
2024-07-18 17:47 ` Liang, Kan
2024-07-18 20:50 ` Ian Rogers
2024-07-19 13:55 ` Liang, Kan
2024-07-19 14:59 ` Ian Rogers
2024-07-19 16:35 ` Liang, Kan
2024-07-19 22:02 ` Ian Rogers
2024-07-22 13:57 ` Liang, Kan
2024-07-22 15:43 ` Ian Rogers
2024-07-22 16:45 ` Liang, Kan [this message]
2024-07-26 7:06 ` Dhananjay Ugwekar
2024-07-26 7:09 ` Dhananjay Ugwekar
2024-07-26 7:52 ` Ian Rogers
2024-07-26 8:17 ` Dhananjay Ugwekar
2024-07-26 14:07 ` Liang, Kan
2024-07-18 0:30 ` [PATCH v2 4/6] libperf cpumap: Add ability to create CPU from a single CPU number Ian Rogers
2024-07-18 0:30 ` [PATCH v2 5/6] perf parse-events: Set is_pmu_core for legacy hardware events Ian Rogers
2024-07-18 0:30 ` [PATCH v2 6/6] perf parse-events: Add "cpu" term to set the CPU an event is recorded on Ian Rogers
2024-07-18 14:09 ` James Clark
2024-07-18 15:07 ` Ian Rogers
2024-07-18 14:41 ` Liang, Kan
2024-07-18 15:12 ` Ian Rogers
2024-07-18 18:02 ` Liang, Kan
2024-07-18 21:06 ` Ian Rogers
2024-07-19 14:14 ` Liang, Kan
2024-07-19 15:01 ` Ian Rogers
2024-07-19 16:42 ` Liang, Kan
2024-07-18 13:42 ` [PATCH v2 0/6] Add support for sysfs event.cpus and cpu event term Dhananjay Ugwekar
2024-07-18 15:00 ` Ian Rogers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aef9d7fd-ac33-45a3-91ed-2b4d3f0f70b5@linux.intel.com \
--to=kan.liang@linux.intel.com \
--cc=Dhananjay.Ugwekar@amd.com \
--cc=acme@kernel.org \
--cc=adrian.hunter@intel.com \
--cc=alexander.shishkin@linux.intel.com \
--cc=ananth.narayan@amd.com \
--cc=asmadeus@codewreck.org \
--cc=bhelgaas@google.com \
--cc=corbet@lwn.net \
--cc=gautham.shenoy@amd.com \
--cc=irogers@google.com \
--cc=james.clark@arm.com \
--cc=jolsa@kernel.org \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-perf-users@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=mingo@redhat.com \
--cc=namhyung@kernel.org \
--cc=peterz@infradead.org \
--cc=ravi.bangoria@amd.com \
--cc=sandipan.das@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).