Linux Perf Users
 help / color / mirror / Atom feed
* [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf
@ 2026-05-29  7:56 Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 01/23] perf/x86/intel: Validate return value of intel_pmu_init_hybrid() Dapeng Mi
                   ` (23 more replies)
  0 siblings, 24 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Patch layout:
- Patches 1-6: Bug fixes and cleanup needed before enabling XSAVES-based
  sampling in NMI context
- Patches 7-9: FPU-related preparation, including xsaves_nmi() and
  related cleanup/optimization
- Patches 10-11: PMI-based XMM sampling support through the existing
  sample_regs_intr/sample_regs_user interfaces for both
  PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
- Patches 12-19: New SIMD register interface and support for
  XMM/YMM/ZMM/OPMASK, APX eGPRs, and SSP through that interface
- Patch 20: Extend arch PEBS to support YMM/ZMM/OPMASK, APX eGPRs, and
  SSP with the new interface
- Patch 21: Enable new interface-based sampling
- Patches 22-23: arch PEBS bug fix and sanity check

Changes since V7:
- Validate the return value of intel_pmu_init_hybrid() (Patch 01/23).
- Replace pt_regs with x86_perf_regs in xen_pmu_irq_handler()
  (Patch 06/23).
- Improve event_has_extended_regs() (Patch 09/23).
- Explicitly ensure the allocated XSAVE area is 64-byte aligned
  (Patch 10/23, Sashiko).
- Clear the SIMD register pointers in x86_user_regs to avoid exposing
  stale register data to user space (Patch 11/23, Sashiko).
- Refine the SIMD register interface and sample data layout, and add the
  missing SIMD data reservation in perf_prepare_sample() for non-x86
  architectures (Patch 12/23, Sashiko).
- Improve perf_simd_reg_validate() for x86 (Patch 13/23, Sashiko).
- Refine SSP sampling and ensure the GPR sub-group flag is set for PEBS
  (Patch 19/23, Sashiko).
- Fix the incorrect large-PEBS check for XMM (Patch 20/23, Sashiko).
- Fix missing handling in x86_pmu_handle_guest_pebs() for back-to-back
  PMI detection (Patch 22/23, Sashiko).
- Strengthen the PEBS record header sanity checks to prevent invalid
  memory access (Patch 23/23, Sashiko).

Changes since V6:
- Fix potential overwritten issue in hybrid PMU structure (patch 01/24)
- Restrict PEBS events work on GP counters if no PEBS baseline suggested
  (patch 02/24)
- Use per-cpu x86_intr_regs for perf_event_nmi_handler() instead of
  temporary variable (patch 06/24)
- Add helper update_fpu_state_and_flag() to ensure TIF_NEED_FPU_LOAD is
  set after save_fpregs_to_fpstate() call (patch 09/24)
- Optimize and simplify x86_pmu_sample_xregs(), etc. (patch 11/24)
- Add macro word_for_each_set_bit() to simplify u64 set-bit iteration
  (patch 13/24)
- Add sanity check for PEBS fragment size (patch 24/24)

Changes since V5:
- Introduce 3 commits to fix newly found PEBS issues (Patch 01~03/19)
- Address Peter comments, including,
  * Fully support user-regs sampling of the SIMD/eGPRs/SSP registers
  * Adjust newly added fields in perf_event_attr to avoid holes
  * Fix the endian issue introduced by for_each_set_bit() in
    event/core.c
  * Remove some unnecessary macros from UAPI header perf_regs.h
  * Enhance b2b NMI detection for all PEBS handlers to ensure identical
    behaviors of all PEBS handlers
- Split perf-tools patches which would be posted in a separate patchset
  later

Changes since V4:
- Rewrite some functions comments and commit messages (Dave)
- Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
- Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
  activating back-to-back NMI detection mechanism (Patch 16/19)
- Fix some minor issues on perf-tool patches (Patch 18/19)

Changes since V3:
- Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
- Only dump the available regs, rather than zero and dump the
  unavailable regs. It's possible that the dumped registers are a subset
  of the requested registers.
- Some minor updates to address Dapeng's comments in V3.

Changes since V2:
- Use the FPU format for the x86_pmu.ext_regs_mask as well
- Add a check before invoking xsaves_nmi()
- Add perf_simd_reg_check() to retrieve the number of available
  registers. If the kernel fails to get the requested registers, e.g.,
  XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
- Add POC perf tool patches

Changes since V1:
- Apply the new interfaces to configure and dump the SIMD registers
- Utilize the existing FPU functions, e.g., xstate_calculate_size,
  get_xsave_addr().


This series adds support on x86 for sampling SIMD registers, APX eGPRs,
and SSP with both PMI-based and PEBS-based sampling.

Starting with Intel Ice Lake, PEBS can sample XMM registers, but PMI-based
XMM sampling is still not available. On newer Intel platforms with
architectural PEBS support, such as Clearwater Forest and Diamond Rapids,
the hardware also gains support for sampling additional SIMD state
(XMM/YMM/ZMM/OPMASK), APX extended GPRs, and SSP.

To support these registers consistently across both PMI and PEBS, this
series makes the following changes:

1. Adds a new perf_event_attr interface for SIMD register selection.
   The existing sample_regs_user/sample_regs_intr bitmaps do not have
   enough space to represent the full SIMD register set, so this series
   introduces dedicated fields for SIMD and predicate register masks and
   element widths.

2. Introduces a new sample data layout for SIMD register data.
   SIMD register payload is appended after the GPR payload, and a new ABI
   flag, PERF_SAMPLE_REGS_ABI_SIMD, indicates its presence.

3. Adds xsaves_nmi() to allow SIMD/eGPR/SSP sampling from PMI handlers in
   NMI context.

4. Extends the arch PEBS path to support YMM/ZMM/OPMASK, APX eGPRs, and
   SSP sampling.


New perf_event_attr fields
--------------------------

This series adds the following fields to perf_event_attr:

    /*
     * Defines the sampling SIMD/PRED(predicate) register bitmaps and
     * qword (8-byte) lengths.
     *
     * sample_simd_regs_enabled != 0 indicates SIMD/PRED registers are
     * requested. The register bitmaps and element sizes are described by:
     *
     *   sample_simd_{vec,pred}_reg_{intr,user}
     *   sample_simd_{vec,pred}_reg_qwords
     *
     * sample_simd_regs_enabled == 0 indicates no SIMD/PRED registers are
     * requested.
     */
    __u16 sample_simd_regs_enabled;
    __u16 sample_simd_pred_reg_qwords;
    __u16 sample_simd_vec_reg_qwords;
    __u16 __reserved_4;

    __u32 sample_simd_pred_reg_intr;
    __u32 sample_simd_pred_reg_user;
    __u64 sample_simd_vec_reg_intr;
    __u64 sample_simd_vec_reg_user;

Field semantics:
- sample_simd_vec_reg_qwords: qword count for regular SIMD registers
- sample_simd_pred_reg_qwords: qword count for predicate registers
- sample_simd_vec_reg_{intr,user}: SIMD register masks for
  PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
- sample_simd_pred_reg_{intr,user}: predicate register masks for
  PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
- sample_simd_regs_enabled: indicates whether the new SIMD fields are in use

Examples:

To sample ZMM registers for PERF_SAMPLE_REGS_INTR:

    sample_simd_regs_enabled = 1
    sample_simd_vec_reg_qwords = 8          // 512 bits = 8 qwords
    sample_simd_vec_reg_intr = 0xffffffff   // zmm0-zmm31

To sample OPMASK registers for PERF_SAMPLE_REGS_USER:

    sample_simd_regs_enabled = 1
    sample_simd_pred_reg_qwords = 1         // 64 bits = 1 qword
    sample_simd_pred_reg_user = 0xff        // opmask0-opmask7

After introducing these fields, bits [63:32] in sample_regs_user and
sample_regs_intr are reclaimed for APX eGPRs and SSP instead of the
previous XMM0-XMM15 encoding.

Discussion of the new SIMD register interface is available at:
https://lore.kernel.org/lkml/20250617081458.GI1613376@noisy.programming.kicks-ass.net/

Sample data layout
------------------

SIMD register data is appended after the GPR data.

For PERF_SAMPLE_REGS_USER:

    { u64 abi;                      // enum perf_sample_regs_abi
      u64 regs[weight(mask)];
      struct {
            u64 nr_vectors;         // 0 ... weight(sample_simd_vec_reg_user)
            u64 vector_qwords;      // 0 ... sample_simd_vec_reg_qwords
            u64 nr_pred;            // 0 ... weight(sample_simd_pred_reg_user)
            u64 pred_qwords;        // 0 ... sample_simd_pred_reg_qwords
            u64 data[nr_vectors * vector_qwords +
                     nr_pred * pred_qwords];
      } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
    }

For PERF_SAMPLE_REGS_INTR:

    { u64 abi;                      // enum perf_sample_regs_abi
      u64 regs[weight(mask)];
      struct {
            u64 nr_vectors;         // 0 ... weight(sample_simd_vec_reg_intr)
            u64 vector_qwords;      // 0 ... sample_simd_vec_reg_qwords
            u64 nr_pred;            // 0 ... weight(sample_simd_pred_reg_intr)
            u64 pred_qwords;        // 0 ... sample_simd_pred_reg_qwords
            u64 data[nr_vectors * vector_qwords +
                     nr_pred * pred_qwords];
      } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
    }

PERF_SAMPLE_REGS_ABI_SIMD indicates that SIMD register data is present.

The metadata fields are encoded as u64 to keep perf tool parsing and
cross-endian support straightforward.

Example
-------

  $ perf record -I?
  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27
  R28 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

  $ perf record --user-regs=?
  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27
  R28 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

  $ perf record -e branches:p \
        -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
        -c 100000 ./test
  $ perf report -D

  ...
  14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
  0xffffffff9f085e24 period: 100000 addr: 0
  ... intr regs: mask 0x18001010003 ABI 64-bit
  .... AX    0xdffffc0000000000
  .... BX    0xffff8882297685e8
  .... R8    0x0000000000000000
  .... R16   0x0000000000000000
  .... R31   0x0000000000000000
  .... SSP   0x0000000000000000
  ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
  .... ZMM[0][0] 0x616c2f656d6f682f
  .... ZMM[0][1] 0x696c2f7265737562
  ...
  .... ZMM[31][7] 0x0000000000000000
  .... OPMASK[0] 0x00000000fffffe00
  ....
  .... OPMASK[7] 0x0000000000000000
  ...

Testing
-------

The following intr-regs, user-regs, and combined sampling tests were run
on DMR and NVL. The sampled register data was reported correctly and no
issues were observed.

  $ ./perf record -e branches:p \
        -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -b -c 10000 sleep 1

  $ ./perf record -e branches \
        -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        --user-regs=ax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
        -b -c 10000 sleep 1

  $ ./perf record -e branches \
        --user-regs=ax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        -Ixmm,ymm,zmm,opmask \
        --user-regs=ax,bx,r8,r16,r31,ssp \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        --user-regs=xmm,ymm,zmm,opmask \
        -Iax,bx,r8,r16,r31,ssp \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        -Iax,bx,r9,r17,r30,ssp \
        --user-regs=ax,bx,r8,r16,r31,ssp \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        -Ixmm,opmask --user-regs=zmm \
        -b -c 10000 taskset -c 0 sleep 1


History:
  v7: https://lore.kernel.org/all/20260324004118.3772171-1-dapeng1.mi@linux.intel.com/
  v6: https://lore.kernel.org/all/20260209072047.2180332-1-dapeng1.mi@linux.intel.com/
  v5: https://lore.kernel.org/all/20251203065500.2597594-1-dapeng1.mi@linux.intel.com/
  v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
  v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
  v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
  v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/

Dapeng Mi (19):
  perf/x86/intel: Validate return value of intel_pmu_init_hybrid()
  perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu()
  perf/x86/intel: Enable large PEBS sampling for XMMs
  perf/x86/intel: Convert x86_perf_regs to per-cpu variables
  perf: Eliminate duplicate arch-specific functions definations
  perf/x86: Use x86_perf_regs in the x86 nmi handlers
  x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  perf/x86: Enable XMM register sampling for REGS_USER case
  perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Support YMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Support ZMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields
  perf: Enhance perf_reg_validate() with simd_enabled argument
  perf/x86: Support eGPRs sampling using sample_regs_* fields
  perf/x86: Support SSP sampling using sample_regs_* fields
  perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling
  perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
    NMIs
  perf/x86/intel: Add sanity check for PEBS fragment size

Kan Liang (4):
  x86/fpu/xstate: Add xsaves_nmi() helper
  perf: Move and enhance has_extended_regs() for arch-specific use
  perf: Add sampling support for SIMD registers
  perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability

 arch/arm/kernel/perf_regs.c           |   8 +-
 arch/arm64/kernel/perf_regs.c         |   8 +-
 arch/csky/kernel/perf_regs.c          |   8 +-
 arch/loongarch/kernel/perf_regs.c     |   8 +-
 arch/mips/kernel/perf_regs.c          |   8 +-
 arch/parisc/kernel/perf_regs.c        |   8 +-
 arch/powerpc/perf/perf_regs.c         |   2 +-
 arch/riscv/kernel/perf_regs.c         |   8 +-
 arch/s390/kernel/perf_regs.c          |   2 +-
 arch/x86/events/core.c                | 415 +++++++++++++++++++++++++-
 arch/x86/events/intel/core.c          | 232 ++++++++++++--
 arch/x86/events/intel/ds.c            | 235 +++++++++++----
 arch/x86/events/perf_event.h          |  85 +++++-
 arch/x86/include/asm/fpu/sched.h      |   5 +-
 arch/x86/include/asm/fpu/xstate.h     |   3 +
 arch/x86/include/asm/msr-index.h      |   7 +
 arch/x86/include/asm/perf_event.h     |  35 ++-
 arch/x86/include/uapi/asm/perf_regs.h |  51 ++++
 arch/x86/kernel/fpu/core.c            |  27 +-
 arch/x86/kernel/fpu/xstate.c          |  25 +-
 arch/x86/kernel/perf_regs.c           | 163 ++++++++--
 arch/x86/xen/pmu.c                    |   5 +-
 include/linux/perf_event.h            |  19 ++
 include/linux/perf_regs.h             |  38 +--
 include/uapi/linux/perf_event.h       |  49 ++-
 kernel/events/core.c                  | 189 ++++++++++--
 26 files changed, 1418 insertions(+), 225 deletions(-)


base-commit: 66cc29745f2f5815482587bb9fbc1e8a3e6fcf00
-- 
2.34.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2026-05-29 11:43 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
2026-05-29  7:56 ` [Patch v8 01/23] perf/x86/intel: Validate return value of intel_pmu_init_hybrid() Dapeng Mi
2026-05-29  8:53   ` sashiko-bot
2026-05-29 11:11   ` Peter Zijlstra
2026-05-29  7:56 ` [Patch v8 02/23] perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu() Dapeng Mi
2026-05-29  8:51   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 03/23] perf/x86/intel: Enable large PEBS sampling for XMMs Dapeng Mi
2026-05-29  7:56 ` [Patch v8 04/23] perf/x86/intel: Convert x86_perf_regs to per-cpu variables Dapeng Mi
2026-05-29  7:56 ` [Patch v8 05/23] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
2026-05-29  7:56 ` [Patch v8 06/23] perf/x86: Use x86_perf_regs in the x86 nmi handlers Dapeng Mi
2026-05-29  7:56 ` [Patch v8 07/23] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
2026-05-29  8:56   ` sashiko-bot
2026-05-29 11:32   ` Peter Zijlstra
2026-05-29  7:56 ` [Patch v8 08/23] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state Dapeng Mi
2026-05-29  7:56 ` [Patch v8 09/23] perf: Move and enhance has_extended_regs() for arch-specific use Dapeng Mi
2026-05-29  7:56 ` [Patch v8 10/23] perf/x86: Enable XMM Register Sampling for Non-PEBS Events Dapeng Mi
2026-05-29  9:02   ` sashiko-bot
2026-05-29 11:38   ` Peter Zijlstra
2026-05-29  7:56 ` [Patch v8 11/23] perf/x86: Enable XMM register sampling for REGS_USER case Dapeng Mi
2026-05-29  9:24   ` sashiko-bot
2026-05-29 11:42   ` Peter Zijlstra
2026-05-29  7:56 ` [Patch v8 12/23] perf: Add sampling support for SIMD registers Dapeng Mi
2026-05-29  8:36   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 13/23] perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
2026-05-29  8:49   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 14/23] perf/x86: Support YMM " Dapeng Mi
2026-05-29  8:47   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 15/23] perf/x86: Support ZMM " Dapeng Mi
2026-05-29  7:56 ` [Patch v8 16/23] perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
2026-05-29  9:21   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 17/23] perf: Enhance perf_reg_validate() with simd_enabled argument Dapeng Mi
2026-05-29  7:56 ` [Patch v8 18/23] perf/x86: Support eGPRs sampling using sample_regs_* fields Dapeng Mi
2026-05-29  9:31   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 19/23] perf/x86: Support SSP " Dapeng Mi
2026-05-29 10:03   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 20/23] perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
2026-05-29  9:45   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 21/23] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
2026-05-29 10:43   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 22/23] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
2026-05-29  9:34   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 23/23] perf/x86/intel: Add sanity check for PEBS fragment size Dapeng Mi
2026-05-29  9:54   ` sashiko-bot
2026-05-29  8:32 ` [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Mi, Dapeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox