public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/3] Support auto counter reload
@ 2024-10-10 19:28 kan.liang
  2024-10-10 19:28 ` [PATCH V2 1/3] perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF kan.liang
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: kan.liang @ 2024-10-10 19:28 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter, ak,
	linux-kernel
  Cc: eranian, thomas.falcon, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Changes since V1:
- Add a check to the reload value which cannot exceeds the max period
- Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
- Update comments explain to case which the event->attr.config2 exceeds
  the group size

The relative rates among two or more events are useful for performance
analysis, e.g., a high branch miss rate may indicate a performance
issue. Usually, the samples with a relative rate that exceeds some
threshold are more useful. However, the traditional sampling takes
samples of events separately. To get the relative rates among two or
more events, a high sample rate is required, which can bring high
overhead. Many samples taken in the non-hotspot area are also dropped
(useless) in the post-process.

Auto Counter Reload (ACR) provides a means for software to specify that,
for each supported counter, the hardware should automatically reload the
counter to a specified initial value upon overflow of chosen counters.
This mechanism enables software to sample based on the relative rate of
two (or more) events, such that a sample (PMI or PEBS) is taken only if
the rate of one event exceeds some threshold relative to the rate of
another event. Taking a PMI or PEBS only when the relative rate of
perfmon events crosses a threshold can have significantly less
performance overhead than other techniques.

The details can be found at Intel Architecture Instruction Set
Extensions and Future Features (053) 8.7 AUTO COUNTER RELOAD.

Examples:

Here is the snippet of the mispredict.c. Since the array has random
numbers, jumps are random and often mispredicted.
The mispredicted rate depends on the compared value.

For the Loop1, ~11% of all branches are mispredicted.
For the Loop2, ~21% of all branches are mispredicted.

main()
{
...
        for (i = 0; i < N; i++)
                data[i] = rand() % 256;
...
        /* Loop 1 */
        for (k = 0; k < 50; k++)
                for (i = 0; i < N; i++)
                        if (data[i] >= 64)
                                sum += data[i];
...

...
        /* Loop 2 */
        for (k = 0; k < 50; k++)
                for (i = 0; i < N; i++)
                        if (data[i] >= 128)
                                sum += data[i];
...
}

Usually, a code with a high branch miss rate means a bad performance.
To understand the branch miss rate of the codes, the traditional method
usually sample both branches and branch-misses events. E.g.,
perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}"
               -c 1000000 -- ./mispredict

[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
The 5106 samples are from both events and spread in both Loops.
In the post process stage, a user can know that the Loop 2 has a 21%
branch miss rate. Then they can focus on the samples of branch-misses
events for the Loop 2.

With this patch, the user can generate the samples only when the branch
miss rate > 20%.
perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu,
                 cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}"
                -- ./mispredict
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]

 $perf report

Percent       │154:   movl    $0x0,-0x14(%rbp)
              │     ↓ jmp     1af
              │     for (i = j; i < N; i++)
              │15d:   mov     -0x10(%rbp),%eax
              │       mov     %eax,-0x18(%rbp)
              │     ↓ jmp     1a2
              │     if (data[i] >= 128)
              │165:   mov     -0x18(%rbp),%eax
              │       cltq
              │       lea     0x0(,%rax,4),%rdx
              │       mov     -0x8(%rbp),%rax
              │       add     %rdx,%rax
              │       mov     (%rax),%eax
              │    ┌──cmp     $0x7f,%eax
100.00   0.00 │    ├──jle     19e
              │    │sum += data[i];

The 2498 samples are all from the branch-misses events for the Loop 2.

The number of samples and overhead is significantly reduced without
losing any information.

Kan Liang (3):
  perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF
  perf/x86/intel: Add the enumeration and flag for the auto counter
    reload
  perf/x86/intel: Support auto counter reload

 arch/x86/events/intel/core.c       | 262 ++++++++++++++++++++++++++++-
 arch/x86/events/perf_event.h       |  21 +++
 arch/x86/events/perf_event_flags.h |   2 +-
 arch/x86/include/asm/msr-index.h   |   4 +
 arch/x86/include/asm/perf_event.h  |   4 +-
 include/linux/perf_event.h         |   2 +
 6 files changed, 288 insertions(+), 7 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-03-14 18:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-10 19:28 [PATCH V2 0/3] Support auto counter reload kan.liang
2024-10-10 19:28 ` [PATCH V2 1/3] perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF kan.liang
2024-10-10 19:28 ` [PATCH V2 2/3] perf/x86/intel: Add the enumeration and flag for the auto counter reload kan.liang
2024-10-10 19:28 ` [PATCH V2 3/3] perf/x86/intel: Support " kan.liang
2025-03-14 10:20   ` Peter Zijlstra
2025-03-14 13:48     ` Liang, Kan
2025-03-14 18:48       ` Liang, Kan
2024-11-04 20:37 ` [PATCH V2 0/3] " Liang, Kan
2025-03-14  9:51 ` Ingo Molnar
2025-03-14 13:06   ` Liang, Kan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox