[Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf

Linux Perf Users
 help / color / mirror / Atom feed

* [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf
@ 2026-05-29  7:56 Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 01/23] perf/x86/intel: Validate return value of intel_pmu_init_hybrid() Dapeng Mi
                   ` (23 more replies)
  0 siblings, 24 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Patch layout:
- Patches 1-6: Bug fixes and cleanup needed before enabling XSAVES-based
  sampling in NMI context
- Patches 7-9: FPU-related preparation, including xsaves_nmi() and
  related cleanup/optimization
- Patches 10-11: PMI-based XMM sampling support through the existing
  sample_regs_intr/sample_regs_user interfaces for both
  PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
- Patches 12-19: New SIMD register interface and support for
  XMM/YMM/ZMM/OPMASK, APX eGPRs, and SSP through that interface
- Patch 20: Extend arch PEBS to support YMM/ZMM/OPMASK, APX eGPRs, and
  SSP with the new interface
- Patch 21: Enable new interface-based sampling
- Patches 22-23: arch PEBS bug fix and sanity check

Changes since V7:
- Validate the return value of intel_pmu_init_hybrid() (Patch 01/23).
- Replace pt_regs with x86_perf_regs in xen_pmu_irq_handler()
  (Patch 06/23).
- Improve event_has_extended_regs() (Patch 09/23).
- Explicitly ensure the allocated XSAVE area is 64-byte aligned
  (Patch 10/23, Sashiko).
- Clear the SIMD register pointers in x86_user_regs to avoid exposing
  stale register data to user space (Patch 11/23, Sashiko).
- Refine the SIMD register interface and sample data layout, and add the
  missing SIMD data reservation in perf_prepare_sample() for non-x86
  architectures (Patch 12/23, Sashiko).
- Improve perf_simd_reg_validate() for x86 (Patch 13/23, Sashiko).
- Refine SSP sampling and ensure the GPR sub-group flag is set for PEBS
  (Patch 19/23, Sashiko).
- Fix the incorrect large-PEBS check for XMM (Patch 20/23, Sashiko).
- Fix missing handling in x86_pmu_handle_guest_pebs() for back-to-back
  PMI detection (Patch 22/23, Sashiko).
- Strengthen the PEBS record header sanity checks to prevent invalid
  memory access (Patch 23/23, Sashiko).

Changes since V6:
- Fix potential overwritten issue in hybrid PMU structure (patch 01/24)
- Restrict PEBS events work on GP counters if no PEBS baseline suggested
  (patch 02/24)
- Use per-cpu x86_intr_regs for perf_event_nmi_handler() instead of
  temporary variable (patch 06/24)
- Add helper update_fpu_state_and_flag() to ensure TIF_NEED_FPU_LOAD is
  set after save_fpregs_to_fpstate() call (patch 09/24)
- Optimize and simplify x86_pmu_sample_xregs(), etc. (patch 11/24)
- Add macro word_for_each_set_bit() to simplify u64 set-bit iteration
  (patch 13/24)
- Add sanity check for PEBS fragment size (patch 24/24)

Changes since V5:
- Introduce 3 commits to fix newly found PEBS issues (Patch 01~03/19)
- Address Peter comments, including,
  * Fully support user-regs sampling of the SIMD/eGPRs/SSP registers
  * Adjust newly added fields in perf_event_attr to avoid holes
  * Fix the endian issue introduced by for_each_set_bit() in
    event/core.c
  * Remove some unnecessary macros from UAPI header perf_regs.h
  * Enhance b2b NMI detection for all PEBS handlers to ensure identical
    behaviors of all PEBS handlers
- Split perf-tools patches which would be posted in a separate patchset
  later

Changes since V4:
- Rewrite some functions comments and commit messages (Dave)
- Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
- Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
  activating back-to-back NMI detection mechanism (Patch 16/19)
- Fix some minor issues on perf-tool patches (Patch 18/19)

Changes since V3:
- Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
- Only dump the available regs, rather than zero and dump the
  unavailable regs. It's possible that the dumped registers are a subset
  of the requested registers.
- Some minor updates to address Dapeng's comments in V3.

Changes since V2:
- Use the FPU format for the x86_pmu.ext_regs_mask as well
- Add a check before invoking xsaves_nmi()
- Add perf_simd_reg_check() to retrieve the number of available
  registers. If the kernel fails to get the requested registers, e.g.,
  XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
- Add POC perf tool patches

Changes since V1:
- Apply the new interfaces to configure and dump the SIMD registers
- Utilize the existing FPU functions, e.g., xstate_calculate_size,
  get_xsave_addr().


This series adds support on x86 for sampling SIMD registers, APX eGPRs,
and SSP with both PMI-based and PEBS-based sampling.

Starting with Intel Ice Lake, PEBS can sample XMM registers, but PMI-based
XMM sampling is still not available. On newer Intel platforms with
architectural PEBS support, such as Clearwater Forest and Diamond Rapids,
the hardware also gains support for sampling additional SIMD state
(XMM/YMM/ZMM/OPMASK), APX extended GPRs, and SSP.

To support these registers consistently across both PMI and PEBS, this
series makes the following changes:

1. Adds a new perf_event_attr interface for SIMD register selection.
   The existing sample_regs_user/sample_regs_intr bitmaps do not have
   enough space to represent the full SIMD register set, so this series
   introduces dedicated fields for SIMD and predicate register masks and
   element widths.

2. Introduces a new sample data layout for SIMD register data.
   SIMD register payload is appended after the GPR payload, and a new ABI
   flag, PERF_SAMPLE_REGS_ABI_SIMD, indicates its presence.

3. Adds xsaves_nmi() to allow SIMD/eGPR/SSP sampling from PMI handlers in
   NMI context.

4. Extends the arch PEBS path to support YMM/ZMM/OPMASK, APX eGPRs, and
   SSP sampling.


New perf_event_attr fields
--------------------------

This series adds the following fields to perf_event_attr:

    /*
     * Defines the sampling SIMD/PRED(predicate) register bitmaps and
     * qword (8-byte) lengths.
     *
     * sample_simd_regs_enabled != 0 indicates SIMD/PRED registers are
     * requested. The register bitmaps and element sizes are described by:
     *
     *   sample_simd_{vec,pred}_reg_{intr,user}
     *   sample_simd_{vec,pred}_reg_qwords
     *
     * sample_simd_regs_enabled == 0 indicates no SIMD/PRED registers are
     * requested.
     */
    __u16 sample_simd_regs_enabled;
    __u16 sample_simd_pred_reg_qwords;
    __u16 sample_simd_vec_reg_qwords;
    __u16 __reserved_4;

    __u32 sample_simd_pred_reg_intr;
    __u32 sample_simd_pred_reg_user;
    __u64 sample_simd_vec_reg_intr;
    __u64 sample_simd_vec_reg_user;

Field semantics:
- sample_simd_vec_reg_qwords: qword count for regular SIMD registers
- sample_simd_pred_reg_qwords: qword count for predicate registers
- sample_simd_vec_reg_{intr,user}: SIMD register masks for
  PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
- sample_simd_pred_reg_{intr,user}: predicate register masks for
  PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
- sample_simd_regs_enabled: indicates whether the new SIMD fields are in use

Examples:

To sample ZMM registers for PERF_SAMPLE_REGS_INTR:

    sample_simd_regs_enabled = 1
    sample_simd_vec_reg_qwords = 8          // 512 bits = 8 qwords
    sample_simd_vec_reg_intr = 0xffffffff   // zmm0-zmm31

To sample OPMASK registers for PERF_SAMPLE_REGS_USER:

    sample_simd_regs_enabled = 1
    sample_simd_pred_reg_qwords = 1         // 64 bits = 1 qword
    sample_simd_pred_reg_user = 0xff        // opmask0-opmask7

After introducing these fields, bits [63:32] in sample_regs_user and
sample_regs_intr are reclaimed for APX eGPRs and SSP instead of the
previous XMM0-XMM15 encoding.

Discussion of the new SIMD register interface is available at:
https://lore.kernel.org/lkml/20250617081458.GI1613376@noisy.programming.kicks-ass.net/

Sample data layout
------------------

SIMD register data is appended after the GPR data.

For PERF_SAMPLE_REGS_USER:

    { u64 abi;                      // enum perf_sample_regs_abi
      u64 regs[weight(mask)];
      struct {
            u64 nr_vectors;         // 0 ... weight(sample_simd_vec_reg_user)
            u64 vector_qwords;      // 0 ... sample_simd_vec_reg_qwords
            u64 nr_pred;            // 0 ... weight(sample_simd_pred_reg_user)
            u64 pred_qwords;        // 0 ... sample_simd_pred_reg_qwords
            u64 data[nr_vectors * vector_qwords +
                     nr_pred * pred_qwords];
      } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
    }

For PERF_SAMPLE_REGS_INTR:

    { u64 abi;                      // enum perf_sample_regs_abi
      u64 regs[weight(mask)];
      struct {
            u64 nr_vectors;         // 0 ... weight(sample_simd_vec_reg_intr)
            u64 vector_qwords;      // 0 ... sample_simd_vec_reg_qwords
            u64 nr_pred;            // 0 ... weight(sample_simd_pred_reg_intr)
            u64 pred_qwords;        // 0 ... sample_simd_pred_reg_qwords
            u64 data[nr_vectors * vector_qwords +
                     nr_pred * pred_qwords];
      } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
    }

PERF_SAMPLE_REGS_ABI_SIMD indicates that SIMD register data is present.

The metadata fields are encoded as u64 to keep perf tool parsing and
cross-endian support straightforward.

Example
-------

  $ perf record -I?
  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27
  R28 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

  $ perf record --user-regs=?
  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27
  R28 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

  $ perf record -e branches:p \
        -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
        -c 100000 ./test
  $ perf report -D

  ...
  14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
  0xffffffff9f085e24 period: 100000 addr: 0
  ... intr regs: mask 0x18001010003 ABI 64-bit
  .... AX    0xdffffc0000000000
  .... BX    0xffff8882297685e8
  .... R8    0x0000000000000000
  .... R16   0x0000000000000000
  .... R31   0x0000000000000000
  .... SSP   0x0000000000000000
  ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
  .... ZMM[0][0] 0x616c2f656d6f682f
  .... ZMM[0][1] 0x696c2f7265737562
  ...
  .... ZMM[31][7] 0x0000000000000000
  .... OPMASK[0] 0x00000000fffffe00
  ....
  .... OPMASK[7] 0x0000000000000000
  ...

Testing
-------

The following intr-regs, user-regs, and combined sampling tests were run
on DMR and NVL. The sampled register data was reported correctly and no
issues were observed.

  $ ./perf record -e branches:p \
        -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -b -c 10000 sleep 1

  $ ./perf record -e branches \
        -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        --user-regs=ax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
        -b -c 10000 sleep 1

  $ ./perf record -e branches \
        --user-regs=ax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        -Ixmm,ymm,zmm,opmask \
        --user-regs=ax,bx,r8,r16,r31,ssp \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        --user-regs=xmm,ymm,zmm,opmask \
        -Iax,bx,r8,r16,r31,ssp \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        -Iax,bx,r9,r17,r30,ssp \
        --user-regs=ax,bx,r8,r16,r31,ssp \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        -Ixmm,opmask --user-regs=zmm \
        -b -c 10000 taskset -c 0 sleep 1


History:
  v7: https://lore.kernel.org/all/20260324004118.3772171-1-dapeng1.mi@linux.intel.com/
  v6: https://lore.kernel.org/all/20260209072047.2180332-1-dapeng1.mi@linux.intel.com/
  v5: https://lore.kernel.org/all/20251203065500.2597594-1-dapeng1.mi@linux.intel.com/
  v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
  v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
  v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
  v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/

Dapeng Mi (19):
  perf/x86/intel: Validate return value of intel_pmu_init_hybrid()
  perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu()
  perf/x86/intel: Enable large PEBS sampling for XMMs
  perf/x86/intel: Convert x86_perf_regs to per-cpu variables
  perf: Eliminate duplicate arch-specific functions definations
  perf/x86: Use x86_perf_regs in the x86 nmi handlers
  x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  perf/x86: Enable XMM register sampling for REGS_USER case
  perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Support YMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Support ZMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields
  perf: Enhance perf_reg_validate() with simd_enabled argument
  perf/x86: Support eGPRs sampling using sample_regs_* fields
  perf/x86: Support SSP sampling using sample_regs_* fields
  perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling
  perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
    NMIs
  perf/x86/intel: Add sanity check for PEBS fragment size

Kan Liang (4):
  x86/fpu/xstate: Add xsaves_nmi() helper
  perf: Move and enhance has_extended_regs() for arch-specific use
  perf: Add sampling support for SIMD registers
  perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability

 arch/arm/kernel/perf_regs.c           |   8 +-
 arch/arm64/kernel/perf_regs.c         |   8 +-
 arch/csky/kernel/perf_regs.c          |   8 +-
 arch/loongarch/kernel/perf_regs.c     |   8 +-
 arch/mips/kernel/perf_regs.c          |   8 +-
 arch/parisc/kernel/perf_regs.c        |   8 +-
 arch/powerpc/perf/perf_regs.c         |   2 +-
 arch/riscv/kernel/perf_regs.c         |   8 +-
 arch/s390/kernel/perf_regs.c          |   2 +-
 arch/x86/events/core.c                | 415 +++++++++++++++++++++++++-
 arch/x86/events/intel/core.c          | 232 ++++++++++++--
 arch/x86/events/intel/ds.c            | 235 +++++++++++----
 arch/x86/events/perf_event.h          |  85 +++++-
 arch/x86/include/asm/fpu/sched.h      |   5 +-
 arch/x86/include/asm/fpu/xstate.h     |   3 +
 arch/x86/include/asm/msr-index.h      |   7 +
 arch/x86/include/asm/perf_event.h     |  35 ++-
 arch/x86/include/uapi/asm/perf_regs.h |  51 ++++
 arch/x86/kernel/fpu/core.c            |  27 +-
 arch/x86/kernel/fpu/xstate.c          |  25 +-
 arch/x86/kernel/perf_regs.c           | 163 ++++++++--
 arch/x86/xen/pmu.c                    |   5 +-
 include/linux/perf_event.h            |  19 ++
 include/linux/perf_regs.h             |  38 +--
 include/uapi/linux/perf_event.h       |  49 ++-
 kernel/events/core.c                  | 189 ++++++++++--
 26 files changed, 1418 insertions(+), 225 deletions(-)


base-commit: 66cc29745f2f5815482587bb9fbc1e8a3e6fcf00
-- 
2.34.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 01/23] perf/x86/intel: Validate return value of intel_pmu_init_hybrid()
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  8:53   ` sashiko-bot
  2026-05-29 11:11   ` Peter Zijlstra
  2026-05-29  7:56 ` [Patch v8 02/23] perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu() Dapeng Mi
                   ` (22 subsequent siblings)
  23 siblings, 2 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

The memory allocation for the x86_pmu.hybrid_pmu[] array in
intel_pmu_init_hybrid() can theoretically fail due to memory shortages.
If this occurs, the initialization of the x86 hybrid PMU would fail.

Currently, the code does not check the return value of the
intel_pmu_init_hybrid() function, which could lead to attempts to access
the uninitialized x86_pmu.hybrid_pmu[] array, potentially causing a
system panic.

So, adds a check for the return value of intel_pmu_init_hybrid() to
prevent invalid memory access in such scenarios.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V8: New patch.

 arch/x86/events/intel/core.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0217e701aeeb..85c329bd52be 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -7870,6 +7870,7 @@ __init int intel_pmu_init(void)
 	int version, i;
 	char *name;
 	struct x86_hybrid_pmu *pmu;
+	int ret;
 
 	/* Architectural Perfmon was introduced starting with Core "Yonah" */
 	if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) {
@@ -8545,7 +8546,9 @@ __init int intel_pmu_init(void)
 		 *
 		 * Initialize the common PerfMon capabilities here.
 		 */
-		intel_pmu_init_hybrid(hybrid_big_small);
+		ret = intel_pmu_init_hybrid(hybrid_big_small);
+		if (ret < 0)
+			return ret;
 
 		x86_pmu.pebs_latency_data = grt_latency_data;
 		x86_pmu.get_event_constraints = adl_get_event_constraints;
@@ -8603,7 +8606,9 @@ __init int intel_pmu_init(void)
 	case INTEL_METEORLAKE:
 	case INTEL_METEORLAKE_L:
 	case INTEL_ARROWLAKE_U:
-		intel_pmu_init_hybrid(hybrid_big_small);
+		ret = intel_pmu_init_hybrid(hybrid_big_small);
+		if (ret < 0)
+			return ret;
 
 		x86_pmu.pebs_latency_data = cmt_latency_data;
 		x86_pmu.get_event_constraints = mtl_get_event_constraints;
@@ -8634,7 +8639,9 @@ __init int intel_pmu_init(void)
 		pr_cont("Pantherlake Hybrid events, ");
 		name = "pantherlake_hybrid";
 
-		intel_pmu_init_hybrid(hybrid_big_small);
+		ret = intel_pmu_init_hybrid(hybrid_big_small);
+		if (ret < 0)
+			return ret;
 
 		/* Initialize big core specific PerfMon capabilities.*/
 		pmu = &x86_pmu.hybrid_pmu[X86_HYBRID_PMU_CORE_IDX];
@@ -8649,7 +8656,9 @@ __init int intel_pmu_init(void)
 		pr_cont("Arrowlake Hybrid events, ");
 		name = "arrowlake_hybrid";
 
-		intel_pmu_init_hybrid(hybrid_big_small);
+		ret = intel_pmu_init_hybrid(hybrid_big_small);
+		if (ret < 0)
+			return ret;
 
 		/* Initialize big core specific PerfMon capabilities.*/
 		pmu = &x86_pmu.hybrid_pmu[X86_HYBRID_PMU_CORE_IDX];
@@ -8666,7 +8675,9 @@ __init int intel_pmu_init(void)
 		pr_cont("Lunarlake Hybrid events, ");
 		name = "lunarlake_hybrid";
 
-		intel_pmu_init_hybrid(hybrid_big_small);
+		ret = intel_pmu_init_hybrid(hybrid_big_small);
+		if (ret < 0)
+			return ret;
 
 		/* Initialize big core specific PerfMon capabilities.*/
 		pmu = &x86_pmu.hybrid_pmu[X86_HYBRID_PMU_CORE_IDX];
@@ -8691,7 +8702,9 @@ __init int intel_pmu_init(void)
 		break;
 
 	case INTEL_ARROWLAKE_H:
-		intel_pmu_init_hybrid(hybrid_big_small_tiny);
+		ret = intel_pmu_init_hybrid(hybrid_big_small_tiny);
+		if (ret < 0)
+			return ret;
 
 		x86_pmu.pebs_latency_data = arl_h_latency_data;
 		x86_pmu.get_event_constraints = arl_h_get_event_constraints;
@@ -8726,7 +8739,9 @@ __init int intel_pmu_init(void)
 	case INTEL_NOVALAKE_L:
 		pr_cont("Novalake Hybrid events, ");
 		name = "novalake_hybrid";
-		intel_pmu_init_hybrid(hybrid_big_small);
+		ret = intel_pmu_init_hybrid(hybrid_big_small);
+		if (ret < 0)
+			return ret;
 
 		x86_pmu.pebs_latency_data = nvl_latency_data;
 		x86_pmu.get_event_constraints = mtl_get_event_constraints;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 01/23] perf/x86/intel: Validate return value of intel_pmu_init_hybrid()
  2026-05-29  7:56 ` [Patch v8 01/23] perf/x86/intel: Validate return value of intel_pmu_init_hybrid() Dapeng Mi
@ 2026-05-29  8:53   ` sashiko-bot
  2026-05-29 11:11   ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  8:53 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The patch introduces new early returns on `intel_pmu_init_hybrid()` failure, which leaks previously allocated resources such as the LBR kmem_cache.
--

commit a727042ce1ace4502dce4dc723479df37ff50da5
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86/intel: Validate return value of intel_pmu_init_hybrid()

This commit adds checks for the return value of intel_pmu_init_hybrid()
in intel_pmu_init() to prevent attempts to access an uninitialized
x86_pmu.hybrid_pmu[] array if memory allocation fails.

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 0217e701aeebb..85c329bd52be3 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
[ ... ]
> @@ -8545,7 +8546,9 @@ __init int intel_pmu_init(void)
>  		 *
>  		 * Initialize the common PerfMon capabilities here.
>  		 */
> -		intel_pmu_init_hybrid(hybrid_big_small);
> +		ret = intel_pmu_init_hybrid(hybrid_big_small);
> +		if (ret < 0)
> +			return ret;

[Severity: Medium]
Does returning early here leak the LBR kmem_cache?

Earlier in intel_pmu_init(), if the CPU supports it, the LBR cache is
allocated:

	if (boot_cpu_has(X86_FEATURE_ARCH_LBR))
		intel_pmu_arch_lbr_init();

When intel_pmu_init() returns an error, its caller init_hw_perf_events()
will zero out the global x86_pmu struct, but it doesn't appear to call
kmem_cache_destroy() on the newly allocated task_ctx_cache. Could this
leave the cache leaked?

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Patch v8 01/23] perf/x86/intel: Validate return value of intel_pmu_init_hybrid()
  2026-05-29  7:56 ` [Patch v8 01/23] perf/x86/intel: Validate return value of intel_pmu_init_hybrid() Dapeng Mi
  2026-05-29  8:53   ` sashiko-bot
@ 2026-05-29 11:11   ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2026-05-29 11:11 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

On Fri, May 29, 2026 at 03:56:23PM +0800, Dapeng Mi wrote:
> The memory allocation for the x86_pmu.hybrid_pmu[] array in
> intel_pmu_init_hybrid() can theoretically fail due to memory shortages.
> If this occurs, the initialization of the x86 hybrid PMU would fail.
> 
> Currently, the code does not check the return value of the
> intel_pmu_init_hybrid() function, which could lead to attempts to access
> the uninitialized x86_pmu.hybrid_pmu[] array, potentially causing a
> system panic.
> 
> So, adds a check for the return value of intel_pmu_init_hybrid() to
> prevent invalid memory access in such scenarios.
> 
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
> 
> V8: New patch.
> 
>  arch/x86/events/intel/core.c | 29 ++++++++++++++++++++++-------
>  1 file changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 0217e701aeeb..85c329bd52be 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -7870,6 +7870,7 @@ __init int intel_pmu_init(void)
>  	int version, i;
>  	char *name;
>  	struct x86_hybrid_pmu *pmu;
> +	int ret;
>  
>  	/* Architectural Perfmon was introduced starting with Core "Yonah" */
>  	if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) {
> @@ -8545,7 +8546,9 @@ __init int intel_pmu_init(void)
>  		 *
>  		 * Initialize the common PerfMon capabilities here.
>  		 */
> -		intel_pmu_init_hybrid(hybrid_big_small);
> +		ret = intel_pmu_init_hybrid(hybrid_big_small);
> +		if (ret < 0)
> +			return ret;

Sashiko notes this will leak the intel_pmu_arch_lbt_init() kmemcache.

I'm not entirely sure we care much about that, but its not really nice.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 02/23] perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu()
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 01/23] perf/x86/intel: Validate return value of intel_pmu_init_hybrid() Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  8:51   ` sashiko-bot
  2026-05-29  7:56 ` [Patch v8 03/23] perf/x86/intel: Enable large PEBS sampling for XMMs Dapeng Mi
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

The current approach initializes hybrid PMU structures immediately before
registering them. This is risky as it can lead to key fields, such as
'capabilities', being inadvertently overwritten.

Although no issues have arisen so far, this method is not ideal. It makes
the PMU structure fields susceptible to being overwritten, especially with
future changes that might initialize fields like 'capabilities' within
init_hybrid_pmu() called by x86_pmu_starting_cpu().

To mitigate this potential problem, move the default hybrid structure
initialization before calling x86_pmu_starting_cpu().

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 4b9e105309c6..17e122e27e0b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2195,8 +2195,20 @@ static int __init init_hw_perf_events(void)
 
 	pmu.attr_update = x86_pmu.attr_update;
 
-	if (!is_hybrid())
+	if (!is_hybrid()) {
 		x86_pmu_show_pmu_cap(NULL);
+	} else {
+		int i;
+
+		/*
+		 * Init default ops.
+		 * Must be called before registering x86_pmu_starting_cpu(),
+		 * otherwise some key PMU fields, e.g., capabilities
+		 * initialized in x86_pmu_starting_cpu(), would be overwritten.
+		 */
+		for (i = 0; i < x86_pmu.num_hybrid_pmus; i++)
+			x86_pmu.hybrid_pmu[i].pmu = pmu;
+	}
 
 	if (!x86_pmu.read)
 		x86_pmu.read = _x86_pmu_read;
@@ -2243,7 +2255,6 @@ static int __init init_hw_perf_events(void)
 		for (i = 0; i < x86_pmu.num_hybrid_pmus; i++) {
 			hybrid_pmu = &x86_pmu.hybrid_pmu[i];
 
-			hybrid_pmu->pmu = pmu;
 			hybrid_pmu->pmu.type = -1;
 			hybrid_pmu->pmu.attr_update = x86_pmu.attr_update;
 			hybrid_pmu->pmu.capabilities |= PERF_PMU_CAP_EXTENDED_HW_TYPE;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 02/23] perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu()
  2026-05-29  7:56 ` [Patch v8 02/23] perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu() Dapeng Mi
@ 2026-05-29  8:51   ` sashiko-bot
  0 siblings, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  8:51 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Use-after-free in `intel_pmu_cpu_dead()` triggered during error rollback if `perf_pmu_register()` fails.
--

commit a6b6c338247bf2e5f630f8fc0cce5709a4fbe801
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu()

This patch moves the default hybrid PMU structure initialization before
calling x86_pmu_starting_cpu(). This change prevents key fields like
capabilities from being inadvertently overwritten during initialization.

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 4b9e105309c6a..17e122e27e0b7 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
[ ... ]
> @@ -2195,8 +2195,20 @@ static int __init init_hw_perf_events(void)
>  
>  	pmu.attr_update = x86_pmu.attr_update;
>  
> -	if (!is_hybrid())
> +	if (!is_hybrid()) {
>  		x86_pmu_show_pmu_cap(NULL);
> +	} else {
> +		int i;
> +
> +		/*
> +		 * Init default ops.
> +		 * Must be called before registering x86_pmu_starting_cpu(),
> +		 * otherwise some key PMU fields, e.g., capabilities
> +		 * initialized in x86_pmu_starting_cpu(), would be overwritten.
> +		 */
> +		for (i = 0; i < x86_pmu.num_hybrid_pmus; i++)
> +			x86_pmu.hybrid_pmu[i].pmu = pmu;

[Severity: High]
This is a pre-existing issue, but I noticed a potential use-after-free
during error rollback.

During init_hw_perf_events(), the CPU hotplug states are set up and
x86_pmu_starting_cpu() sets cpuc->pmu to point to elements within the
dynamically allocated x86_pmu.hybrid_pmu array.

If the subsequent perf_pmu_register() loop fails, the error path frees the
x86_pmu.hybrid_pmu array via kfree() and jumps to out2 to unregister the
CPU hotplug states:

    if (i < x86_pmu.num_hybrid_pmus) {
        ...
        kfree(x86_pmu.hybrid_pmu);
        ...
        goto out2;
    }
    ...
    out:
        cpuhp_remove_state(CPUHP_PERF_X86_PREPARE);

Tearing down the CPUHP_PERF_X86_PREPARE state triggers x86_pmu_dead_cpu(),
which eventually calls intel_pmu_cpu_dead().

Could tearing down CPUHP_PERF_X86_PREPARE evaluate the dangling cpuc->pmu
pointer in intel_pmu_cpu_dead(), causing cpumask_clear_cpu() to write to
the array that was just freed?

> +	}
>  
>  	if (!x86_pmu.read)
>  		x86_pmu.read = _x86_pmu_read;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=2

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 03/23] perf/x86/intel: Enable large PEBS sampling for XMMs
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 01/23] perf/x86/intel: Validate return value of intel_pmu_init_hybrid() Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 02/23] perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu() Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 04/23] perf/x86/intel: Convert x86_perf_regs to per-cpu variables Dapeng Mi
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Modern PEBS hardware supports directly sampling XMM registers, then
large PEBS can be enabled for XMM registers just like other GPRs.

Reported-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 85c329bd52be..92cb9a716e83 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4702,7 +4702,8 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 		flags &= ~PERF_SAMPLE_REGS_USER;
 	if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
 		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_intr & ~PEBS_GP_REGS)
+	if (event->attr.sample_regs_intr &
+	    ~(PEBS_GP_REGS | PERF_REG_EXTENDED_MASK))
 		flags &= ~PERF_SAMPLE_REGS_INTR;
 	return flags;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Patch v8 04/23] perf/x86/intel: Convert x86_perf_regs to per-cpu variables
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (2 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 03/23] perf/x86/intel: Enable large PEBS sampling for XMMs Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 05/23] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Currently, the intel_pmu_drain_pebs_icl() and intel_pmu_drain_arch_pebs()
helpers define many temporary variables. Upcoming patches will add new
fields like *ymm_regs and *zmm_regs to the x86_perf_regs structure to
support sampling for these SIMD registers. This would increase the stack
size consumed by these helpers, potentially triggering the warning:
"the frame size of 1048 bytes is larger than 1024 bytes
 [-Wframe-larger-than=]".

To eliminate this warning, convert x86_perf_regs to per-cpu variables.

Please note *_drain_pebs() can't be interrupted by other NMIs since
either it's already in NMI context or PMU is already disabled.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/ds.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index cb72af9b61ce..a31648d2adb1 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2933,6 +2933,8 @@ __intel_pmu_pebs_last_event(struct perf_event *event,
 	}
 }
 
+static DEFINE_PER_CPU(struct x86_perf_regs, x86_pebs_regs);
+
 static __always_inline void
 __intel_pmu_pebs_events(struct perf_event *event,
 			struct pt_regs *iregs,
@@ -2942,8 +2944,8 @@ __intel_pmu_pebs_events(struct perf_event *event,
 			setup_fn setup_sample)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-	struct x86_perf_regs perf_regs;
-	struct pt_regs *regs = &perf_regs.regs;
+	struct x86_perf_regs *perf_regs = this_cpu_ptr(&x86_pebs_regs);
+	struct pt_regs *regs = &perf_regs->regs;
 	void *at = get_next_pebs_record_by_bit(base, top, bit);
 	int cnt = count;
 
@@ -3191,8 +3193,8 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
-	struct x86_perf_regs perf_regs;
-	struct pt_regs *regs = &perf_regs.regs;
+	struct x86_perf_regs *perf_regs = this_cpu_ptr(&x86_pebs_regs);
+	struct pt_regs *regs = &perf_regs->regs;
 	struct pebs_basic *basic;
 	void *base, *at, *top;
 	u64 mask;
@@ -3242,8 +3244,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	union arch_pebs_index index;
-	struct x86_perf_regs perf_regs;
-	struct pt_regs *regs = &perf_regs.regs;
+	struct x86_perf_regs *perf_regs = this_cpu_ptr(&x86_pebs_regs);
+	struct pt_regs *regs = &perf_regs->regs;
 	void *base, *at, *top;
 	u64 mask;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Patch v8 05/23] perf: Eliminate duplicate arch-specific functions definations
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (3 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 04/23] perf/x86/intel: Convert x86_perf_regs to per-cpu variables Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 06/23] perf/x86: Use x86_perf_regs in the x86 nmi handlers Dapeng Mi
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Define default common __weak functions for perf_reg_value(),
perf_reg_validate(), perf_reg_abi() and perf_get_regs_user(). This helps
to eliminate the duplicated arch-specific definations.

No function changes intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/arm/kernel/perf_regs.c       |  6 ------
 arch/arm64/kernel/perf_regs.c     |  6 ------
 arch/csky/kernel/perf_regs.c      |  6 ------
 arch/loongarch/kernel/perf_regs.c |  6 ------
 arch/mips/kernel/perf_regs.c      |  6 ------
 arch/parisc/kernel/perf_regs.c    |  6 ------
 arch/riscv/kernel/perf_regs.c     |  6 ------
 arch/x86/kernel/perf_regs.c       |  6 ------
 include/linux/perf_regs.h         | 32 ++++++-------------------------
 kernel/events/core.c              | 22 +++++++++++++++++++++
 10 files changed, 28 insertions(+), 74 deletions(-)

diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index 0529f90395c9..d575a4c3ca56 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -31,9 +31,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index b4eece3eb17d..70e2f13f587f 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -98,9 +98,3 @@ u64 perf_reg_abi(struct task_struct *task)
 		return PERF_SAMPLE_REGS_ABI_64;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/csky/kernel/perf_regs.c b/arch/csky/kernel/perf_regs.c
index 09b7f88a2d6a..94601f37b596 100644
--- a/arch/csky/kernel/perf_regs.c
+++ b/arch/csky/kernel/perf_regs.c
@@ -31,9 +31,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/loongarch/kernel/perf_regs.c b/arch/loongarch/kernel/perf_regs.c
index 263ac4ab5af6..8dd604f01745 100644
--- a/arch/loongarch/kernel/perf_regs.c
+++ b/arch/loongarch/kernel/perf_regs.c
@@ -45,9 +45,3 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs->regs[idx];
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/mips/kernel/perf_regs.c b/arch/mips/kernel/perf_regs.c
index e686780d1647..7736d3c5ebd2 100644
--- a/arch/mips/kernel/perf_regs.c
+++ b/arch/mips/kernel/perf_regs.c
@@ -60,9 +60,3 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return (s64)v; /* Sign extend if 32-bit. */
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/parisc/kernel/perf_regs.c b/arch/parisc/kernel/perf_regs.c
index 10a1a5f06a18..b9fe1f2fcb9b 100644
--- a/arch/parisc/kernel/perf_regs.c
+++ b/arch/parisc/kernel/perf_regs.c
@@ -53,9 +53,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_64;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/riscv/kernel/perf_regs.c b/arch/riscv/kernel/perf_regs.c
index fd304a248de6..3bba8deababb 100644
--- a/arch/riscv/kernel/perf_regs.c
+++ b/arch/riscv/kernel/perf_regs.c
@@ -35,9 +35,3 @@ u64 perf_reg_abi(struct task_struct *task)
 #endif
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 624703af80a1..81204cb7f723 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -100,12 +100,6 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
 #else /* CONFIG_X86_64 */
 #define REG_NOSUPPORT ((1ULL << PERF_REG_X86_DS) | \
 		       (1ULL << PERF_REG_X86_ES) | \
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index f632c5725f16..144bcc3ff19f 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -9,6 +9,12 @@ struct perf_regs {
 	struct pt_regs	*regs;
 };
 
+u64 perf_reg_value(struct pt_regs *regs, int idx);
+int perf_reg_validate(u64 mask);
+u64 perf_reg_abi(struct task_struct *task);
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs);
+
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
 
@@ -16,35 +22,9 @@ struct perf_regs {
 #define PERF_REG_EXTENDED_MASK	0
 #endif
 
-u64 perf_reg_value(struct pt_regs *regs, int idx);
-int perf_reg_validate(u64 mask);
-u64 perf_reg_abi(struct task_struct *task);
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs);
 #else
 
 #define PERF_REG_EXTENDED_MASK	0
 
-static inline u64 perf_reg_value(struct pt_regs *regs, int idx)
-{
-	return 0;
-}
-
-static inline int perf_reg_validate(u64 mask)
-{
-	return mask ? -ENOSYS : 0;
-}
-
-static inline u64 perf_reg_abi(struct task_struct *task)
-{
-	return PERF_SAMPLE_REGS_ABI_NONE;
-}
-
-static inline void perf_get_regs_user(struct perf_regs *regs_user,
-				      struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
 #endif /* CONFIG_HAVE_PERF_REGS */
 #endif /* _LINUX_PERF_REGS_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7935d5663944..fedc0cf65f9e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7769,6 +7769,28 @@ unsigned long perf_instruction_pointer(struct perf_event *event,
 	return perf_arch_instruction_pointer(regs);
 }
 
+u64 __weak perf_reg_value(struct pt_regs *regs, int idx)
+{
+	return 0;
+}
+
+int __weak perf_reg_validate(u64 mask)
+{
+	return mask ? -ENOSYS : 0;
+}
+
+u64 __weak perf_reg_abi(struct task_struct *task)
+{
+	return PERF_SAMPLE_REGS_ABI_NONE;
+}
+
+void __weak perf_get_regs_user(struct perf_regs *regs_user,
+			       struct pt_regs *regs)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
+
 static void
 perf_output_sample_regs(struct perf_output_handle *handle,
 			struct pt_regs *regs, u64 mask)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Patch v8 06/23] perf/x86: Use x86_perf_regs in the x86 nmi handlers
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (4 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 05/23] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 07/23] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi, Kan Liang

More and more regs will be supported in the overflow, e.g., more vector
registers, SSP, etc. The generic pt_regs struct cannot store all of
them. Use a X86 specific x86_perf_regs instead.

The struct pt_regs *regs is still passed to x86_pmu_handle_irq(). There
is no functional change for the existing code.

AMD IBS's NMI handler doesn't utilize the static call
x86_pmu_handle_irq(). The x86_perf_regs struct doesn't apply to the AMD
IBS. It can be added separately later when AMD IBS supports more regs.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V8: Use x86_perf_regs for xen_pmu_irq_handler() as well.

 arch/x86/events/core.c | 5 ++++-
 arch/x86/xen/pmu.c     | 5 ++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 17e122e27e0b..17c8f44ee43b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1788,9 +1788,11 @@ void perf_put_guest_lvtpc(void)
 EXPORT_SYMBOL_FOR_KVM(perf_put_guest_lvtpc);
 #endif /* CONFIG_PERF_GUEST_MEDIATED_PMU */
 
+static DEFINE_PER_CPU(struct x86_perf_regs, x86_intr_regs);
 static int
 perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
+	struct x86_perf_regs *x86_regs = this_cpu_ptr(&x86_intr_regs);
 	u64 start_clock;
 	u64 finish_clock;
 	int ret;
@@ -1814,7 +1816,8 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 		return NMI_DONE;
 
 	start_clock = sched_clock();
-	ret = static_call(x86_pmu_handle_irq)(regs);
+	x86_regs->regs = *regs;
+	ret = static_call(x86_pmu_handle_irq)(&x86_regs->regs);
 	finish_clock = sched_clock();
 
 	perf_sample_event_took(finish_clock - start_clock);
diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c
index 8f89ce0b67e3..de3d7d391a5e 100644
--- a/arch/x86/xen/pmu.c
+++ b/arch/x86/xen/pmu.c
@@ -455,12 +455,14 @@ static void xen_convert_regs(const struct xen_pmu_regs *xen_regs,
 	}
 }
 
+static DEFINE_PER_CPU(struct x86_perf_regs, x86_xen_intr_regs);
 irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id)
 {
 	int err, ret = IRQ_NONE;
 	struct pt_regs regs = {0};
 	const struct xen_pmu_data *xenpmu_data = get_xenpmu_data();
 	uint8_t xenpmu_flags = get_xenpmu_flags();
+	struct x86_perf_regs *x86_regs = this_cpu_ptr(&x86_xen_intr_regs);
 
 	if (!xenpmu_data) {
 		pr_warn_once("%s: pmudata not initialized\n", __func__);
@@ -471,7 +473,8 @@ irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id)
 		xenpmu_flags | XENPMU_IRQ_PROCESSING;
 	xen_convert_regs(&xenpmu_data->pmu.r.regs, &regs,
 			 xenpmu_data->pmu.pmu_flags);
-	if (x86_pmu.handle_irq(&regs))
+	x86_regs->regs = regs;
+	if (x86_pmu.handle_irq(&x86_regs->regs))
 		ret = IRQ_HANDLED;
 
 	/* Write out cached context to HW */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Patch v8 07/23] x86/fpu/xstate: Add xsaves_nmi() helper
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (5 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 06/23] perf/x86: Use x86_perf_regs in the x86 nmi handlers Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  8:56   ` sashiko-bot
  2026-05-29 11:32   ` Peter Zijlstra
  2026-05-29  7:56 ` [Patch v8 08/23] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state Dapeng Mi
                   ` (16 subsequent siblings)
  23 siblings, 2 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Add xsaves_nmi() to save supported xsave states in NMI handler.

This function is similar to xsaves(), but should only be called within
a NMI handler. This function returns the actual register contents at
the moment the NMI occurs.

Currently the perf subsystem is the sole user of this helper. It uses
this function to snapshot SIMD (XMM/YMM/ZMM) and APX eGPRs registers
which would be added in subsequent patches.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/fpu/xstate.h |  1 +
 arch/x86/kernel/fpu/xstate.c      | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 7a7dc9d56027..38fa8ff26559 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -110,6 +110,7 @@ int xfeature_size(int xfeature_nr);
 
 void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
+void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
 int xfd_enable_feature(u64 xfd_err);
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index a7b6524a9dea..4394091c4791 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1474,6 +1474,29 @@ void xrstors(struct xregs_state *xstate, u64 mask)
 	WARN_ON_ONCE(err);
 }
 
+/**
+ * xsaves_nmi - Save selected components to a kernel xstate buffer in NMI
+ * @xstate:	Pointer to the buffer
+ * @mask:	Feature mask to select the components to save
+ *
+ * This function is similar to xsaves(), but should only be called within
+ * a NMI handler. This function returns the actual register contents at
+ * the moment the NMI occurs.
+ *
+ * Currently, the perf subsystem is the sole user of this helper. It uses
+ * the function to snapshot SIMD (XMM/YMM/ZMM) and APX eGPRs registers.
+ */
+void xsaves_nmi(struct xregs_state *xstate, u64 mask)
+{
+	int err;
+
+	if (!in_nmi())
+		return;
+
+	XSTATE_OP(XSAVES, xstate, (u32)mask, (u32)(mask >> 32), err);
+	WARN_ON_ONCE(err);
+}
+
 #if IS_ENABLED(CONFIG_KVM)
 void fpstate_clear_xstate_component(struct fpstate *fpstate, unsigned int xfeature)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 07/23] x86/fpu/xstate: Add xsaves_nmi() helper
  2026-05-29  7:56 ` [Patch v8 07/23] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
@ 2026-05-29  8:56   ` sashiko-bot
  2026-05-29 11:32   ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  8:56 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Executing XSAVES blindly from an NMI context captures in-flight kernel FPU state, leading to a severe information leak via perf REGS_INTR.
--

commit d34a5f4e0fb88fb36d75fc06f907af1f26922ffe
Author: Kan Liang <kan.liang@linux.intel.com>

x86/fpu/xstate: Add xsaves_nmi() helper

This commit introduces the xsaves_nmi() function to allow saving supported
xsave states directly from an NMI handler. It is currently intended for use
by the perf subsystem to snapshot SIMD and APX eGPRs registers.

> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index a7b6524a9dea2..4394091c47910 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c

[ ... ]

> @@ -1474,6 +1474,29 @@ void xrstors(struct xregs_state *xstate, u64 mask)
>  	WARN_ON_ONCE(err);
>  }
>  
> +/**
> + * xsaves_nmi - Save selected components to a kernel xstate buffer in NMI
> + * @xstate:	Pointer to the buffer
> + * @mask:	Feature mask to select the components to save
> + *
> + * This function is similar to xsaves(), but should only be called within
> + * a NMI handler. This function returns the actual register contents at
> + * the moment the NMI occurs.
> + *
> + * Currently, the perf subsystem is the sole user of this helper. It uses
> + * the function to snapshot SIMD (XMM/YMM/ZMM) and APX eGPRs registers.
> + */
> +void xsaves_nmi(struct xregs_state *xstate, u64 mask)
> +{
> +	int err;
> +
> +	if (!in_nmi())
> +		return;
> +
> +	XSTATE_OP(XSAVES, xstate, (u32)mask, (u32)(mask >> 32), err);

[Severity: High]
Could executing XSAVES directly from an NMI context capture in-flight kernel
FPU state and lead to an information leak via perf REGS_INTR?

If an unprivileged user creates a hardware perf event with exclude_kernel=1
and requests PERF_SAMPLE_REGS_INTR, they bypass the perf_allow_kernel()
privilege check. Due to hardware skid, a PMI might be delivered after the CPU
has already transitioned into kernel space.

If the interrupted kernel context is actively using the FPU (for example,
inside kernel_fpu_begin() for cryptographic operations), the hardware FPU
registers would contain kernel data.

Since this helper captures the hardware FPU state unconditionally, and the
perf subsystem does not discard or filter REGS_INTR samples based on
privilege when a skid into kernel mode occurs, will this export sensitive
kernel SIMD registers to the unprivileged user's ring buffer?

> +	WARN_ON_ONCE(err);
> +}
> +

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=7

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Patch v8 07/23] x86/fpu/xstate: Add xsaves_nmi() helper
  2026-05-29  7:56 ` [Patch v8 07/23] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
  2026-05-29  8:56   ` sashiko-bot
@ 2026-05-29 11:32   ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2026-05-29 11:32 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Fri, May 29, 2026 at 03:56:29PM +0800, Dapeng Mi wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Add xsaves_nmi() to save supported xsave states in NMI handler.
> 
> This function is similar to xsaves(), but should only be called within
> a NMI handler. This function returns the actual register contents at
> the moment the NMI occurs.
> 
> Currently the perf subsystem is the sole user of this helper. It uses
> this function to snapshot SIMD (XMM/YMM/ZMM) and APX eGPRs registers
> which would be added in subsequent patches.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/include/asm/fpu/xstate.h |  1 +
>  arch/x86/kernel/fpu/xstate.c      | 23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> index 7a7dc9d56027..38fa8ff26559 100644
> --- a/arch/x86/include/asm/fpu/xstate.h
> +++ b/arch/x86/include/asm/fpu/xstate.h
> @@ -110,6 +110,7 @@ int xfeature_size(int xfeature_nr);
>  
>  void xsaves(struct xregs_state *xsave, u64 mask);
>  void xrstors(struct xregs_state *xsave, u64 mask);
> +void xsaves_nmi(struct xregs_state *xsave, u64 mask);
>  
>  int xfd_enable_feature(u64 xfd_err);
>  
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index a7b6524a9dea..4394091c4791 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1474,6 +1474,29 @@ void xrstors(struct xregs_state *xstate, u64 mask)
>  	WARN_ON_ONCE(err);
>  }
>  
> +/**
> + * xsaves_nmi - Save selected components to a kernel xstate buffer in NMI
> + * @xstate:	Pointer to the buffer
> + * @mask:	Feature mask to select the components to save
> + *
> + * This function is similar to xsaves(), but should only be called within
> + * a NMI handler. This function returns the actual register contents at
> + * the moment the NMI occurs.
> + *
> + * Currently, the perf subsystem is the sole user of this helper. It uses
> + * the function to snapshot SIMD (XMM/YMM/ZMM) and APX eGPRs registers.
> + */
> +void xsaves_nmi(struct xregs_state *xstate, u64 mask)
> +{
> +	int err;
> +
> +	if (!in_nmi())
> +		return;
> +
> +	XSTATE_OP(XSAVES, xstate, (u32)mask, (u32)(mask >> 32), err);
> +	WARN_ON_ONCE(err);
> +}

Sashiko raises a fun point vs skid; if an exclude_kernel=1 event trips
inside the kernel this can potentially leak a whole pile of kernel regs.

But of course the same thing is true for the existing setup. So perhaps
that doesn't need to concern us now.

There used to be discussions about this case, and I think we had generic
code to sanitize such boundary events, but I can't seem to find that in
the current tree.

Mark, ISTR you were involved at some point, any idea what happened?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 08/23] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (6 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 07/23] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 09/23] perf: Move and enhance has_extended_regs() for arch-specific use Dapeng Mi
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Following Peter and Dave's suggestion, Ensure that the TIF_NEED_FPU_LOAD
flag is always set after saving the FPU state. This guarantees that the
user space FPU state has been saved whenever the TIF_NEED_FPU_LOAD flag
is set.

A subsequent patch will verify if the user space FPU state can be
retrieved from the saved task FPU state in the NMI context by checking
the TIF_NEED_FPU_LOAD flag.

Please check the below link to get more background about the suggestion.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/all/20251204154721.GB2619703@noisy.programming.kicks-ass.net/
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/fpu/sched.h |  5 +++--
 arch/x86/kernel/fpu/core.c       | 27 ++++++++++++++++++++-------
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index 89004f4ca208..dcb2fa5f06d6 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -10,6 +10,8 @@
 #include <asm/trace/fpu.h>
 
 extern void save_fpregs_to_fpstate(struct fpu *fpu);
+extern void update_fpu_state_and_flag(struct fpu *fpu,
+				      struct task_struct *task);
 extern void fpu__drop(struct task_struct *tsk);
 extern int  fpu_clone(struct task_struct *dst, u64 clone_flags, bool minimal,
 		      unsigned long shstk_addr);
@@ -36,8 +38,7 @@ static inline void switch_fpu(struct task_struct *old, int cpu)
 	    !(old->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		struct fpu *old_fpu = x86_task_fpu(old);
 
-		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
-		save_fpregs_to_fpstate(old_fpu);
+		update_fpu_state_and_flag(old_fpu, old);
 		/*
 		 * The save operation preserved register state, so the
 		 * fpu_fpregs_owner_ctx is still @old_fpu. Store the
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 608983806fd7..48d1ab50a961 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -213,6 +213,19 @@ void restore_fpregs_from_fpstate(struct fpstate *fpstate, u64 mask)
 	}
 }
 
+/*
+ * Save the FPU register state in fpu->fpstate->regs and set
+ * TIF_NEED_FPU_LOAD subsequently.
+ *
+ * Must be called with fpregs_lock() held, ensuring flag
+ * TIF_NEED_FPU_LOAD is set last.
+ */
+void update_fpu_state_and_flag(struct fpu *fpu, struct task_struct *task)
+{
+	save_fpregs_to_fpstate(fpu);
+	set_tsk_thread_flag(task, TIF_NEED_FPU_LOAD);
+}
+
 void fpu_reset_from_exception_fixup(void)
 {
 	restore_fpregs_from_fpstate(&init_fpstate, XFEATURE_MASK_FPSTATE);
@@ -379,17 +392,19 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 
 	fpregs_lock();
 	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD))
-		save_fpregs_to_fpstate(fpu);
+		update_fpu_state_and_flag(fpu, current);
 
 	/* Swap fpstate */
 	if (enter_guest) {
-		fpu->__task_fpstate = cur_fps;
+		WRITE_ONCE(fpu->__task_fpstate, cur_fps);
+		barrier();
 		fpu->fpstate = guest_fps;
 		guest_fps->in_use = true;
 	} else {
 		guest_fps->in_use = false;
 		fpu->fpstate = fpu->__task_fpstate;
-		fpu->__task_fpstate = NULL;
+		barrier();
+		WRITE_ONCE(fpu->__task_fpstate, NULL);
 	}
 
 	cur_fps = fpu->fpstate;
@@ -481,10 +496,8 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 	this_cpu_write(kernel_fpu_allowed, false);
 
 	if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
-	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
-		set_thread_flag(TIF_NEED_FPU_LOAD);
-		save_fpregs_to_fpstate(x86_task_fpu(current));
-	}
+	    !test_thread_flag(TIF_NEED_FPU_LOAD))
+		update_fpu_state_and_flag(x86_task_fpu(current), current);
 	__cpu_invalidate_fpregs_state();
 
 	/* Put sane initial values into the control registers. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Patch v8 09/23] perf: Move and enhance has_extended_regs() for arch-specific use
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (7 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 08/23] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 10/23] perf/x86: Enable XMM Register Sampling for Non-PEBS Events Dapeng Mi
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Move has_extended_regs() to include/linux/perf_event.h so it can be used
by arch-specific code.

While moving it, enhance the check logic and rename it to
event_has_extended_regs() to match existing perf event helper naming.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 include/linux/perf_event.h | 10 ++++++++++
 kernel/events/core.c       |  8 +-------
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 48d851fbd8ea..fb38affa7352 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1534,6 +1534,16 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_extended_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return ((attr->sample_type & PERF_SAMPLE_REGS_USER) &&
+		(attr->sample_regs_user & PERF_REG_EXTENDED_MASK)) ||
+	       ((attr->sample_type & PERF_SAMPLE_REGS_INTR) &&
+		(attr->sample_regs_intr & PERF_REG_EXTENDED_MASK));
+}
+
 static inline bool event_has_any_exclude_flag(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fedc0cf65f9e..2ce553db4dcb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13056,12 +13056,6 @@ int perf_pmu_unregister(struct pmu *pmu)
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
-static inline bool has_extended_regs(struct perf_event *event)
-{
-	return (event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK) ||
-	       (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK);
-}
-
 static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 {
 	struct perf_event_context *ctx = NULL;
@@ -13096,7 +13090,7 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 		goto err_pmu;
 
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
-	    has_extended_regs(event)) {
+	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
 		goto err_destroy;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Patch v8 10/23] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (8 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 09/23] perf: Move and enhance has_extended_regs() for arch-specific use Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  9:02   ` sashiko-bot
  2026-05-29 11:38   ` Peter Zijlstra
  2026-05-29  7:56 ` [Patch v8 11/23] perf/x86: Enable XMM register sampling for REGS_USER case Dapeng Mi
                   ` (13 subsequent siblings)
  23 siblings, 2 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi, Kan Liang

Previously, XMM register sampling was only available for PEBS events
starting from Icelake. Currently the support is now extended to non-PEBS
events by utilizing the xsaves instruction, thereby completing the
feature set.

To implement this, a 64-byte aligned buffer is required. A per-CPU
ext_regs_buf is introduced to store SIMD and other registers, with an
approximate size of 2K. The buffer is allocated using kzalloc_node(),
ensuring natural and 64-byte alignment for all kmalloc() allocations
with powers of 2.

XMM sampling for non-PEBS events is supported in the REGS_INTR case.
Support for REGS_USER will be added in a subsequent patch. For PEBS
events, XMM register sampling data is directly retrieved from PEBS
records.

Future support for additional vector registers (YMM/ZMM/OPMASK) is
planned. An `ext_regs_mask` is added to track the supported vector
register groups.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c            | 181 ++++++++++++++++++++++++++++--
 arch/x86/events/intel/core.c      |  36 +++++-
 arch/x86/events/intel/ds.c        |  18 ++-
 arch/x86/events/perf_event.h      |  13 +++
 arch/x86/include/asm/fpu/xstate.h |   2 +
 arch/x86/include/asm/perf_event.h |   5 +-
 arch/x86/kernel/fpu/xstate.c      |   2 +-
 7 files changed, 240 insertions(+), 17 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 17c8f44ee43b..c219a563434d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -410,6 +410,56 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event)
 	return x86_pmu_extra_regs(val, event);
 }
 
+static DEFINE_PER_CPU(struct xregs_state *, ext_regs_buf);
+
+static void release_ext_regs_buffers(void)
+{
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		kfree(per_cpu(ext_regs_buf, cpu));
+		per_cpu(ext_regs_buf, cpu) = NULL;
+	}
+}
+
+static void reserve_ext_regs_buffers(void)
+{
+	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
+	unsigned int size;
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	/* +64 bytes for the 64 bytes alignment request of xsave area. */
+	size = xstate_calculate_size(x86_pmu.ext_regs_mask, compacted) + 64;
+
+	for_each_possible_cpu(cpu) {
+		per_cpu(ext_regs_buf, cpu) = kzalloc_node(size, GFP_KERNEL,
+							  cpu_to_node(cpu));
+		if (WARN_ON_ONCE(!per_cpu(ext_regs_buf, cpu)))
+			goto err;
+	}
+
+	return;
+
+err:
+	release_ext_regs_buffers();
+}
+
+static inline struct xregs_state *get_ext_regs_buf(int cpu)
+{
+	void *buf = per_cpu(ext_regs_buf, cpu);
+	struct xregs_state *xsave;
+
+	xsave = buf ? PTR_ALIGN(buf, 64) : NULL;
+
+	return xsave;
+}
+
 int x86_reserve_hardware(void)
 {
 	int err = 0;
@@ -422,6 +472,7 @@ int x86_reserve_hardware(void)
 			} else {
 				reserve_ds_buffers();
 				reserve_lbr_buffers();
+				reserve_ext_regs_buffers();
 			}
 		}
 		if (!err)
@@ -438,6 +489,7 @@ void x86_release_hardware(void)
 		release_pmc_hardware();
 		release_ds_buffers();
 		release_lbr_buffers();
+		release_ext_regs_buffers();
 		mutex_unlock(&pmc_reserve_mutex);
 	}
 }
@@ -655,18 +707,26 @@ int x86_pmu_hw_config(struct perf_event *event)
 			return -EINVAL;
 	}
 
-	/* sample_regs_user never support XMM registers */
-	if (unlikely(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK))
-		return -EINVAL;
-	/*
-	 * Besides the general purpose registers, XMM registers may
-	 * be collected in PEBS on some platforms, e.g. Icelake
-	 */
-	if (unlikely(event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK)) {
-		if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
-			return -EINVAL;
+	if (event->attr.sample_type & PERF_SAMPLE_REGS_INTR) {
+		/*
+		 * Besides the general purpose registers, XMM registers may
+		 * be collected as well.
+		 */
+		if (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+				return -EINVAL;
+			if (is_sampling_event(event) && !event->attr.precise_ip &&
+			    !this_cpu_has(X86_FEATURE_XSAVES))
+				return -EINVAL;
+		}
+	}
 
-		if (!event->attr.precise_ip)
+	if (event->attr.sample_type & PERF_SAMPLE_REGS_USER) {
+		/*
+		 * Currently XMM registers sampling for REGS_USER is not
+		 * supported yet.
+		 */
+		if (event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK)
 			return -EINVAL;
 	}
 
@@ -1705,6 +1765,105 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
+void x86_pmu_clear_perf_regs(struct pt_regs *regs)
+{
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+	perf_regs->xmm_regs = NULL;
+}
+
+static void update_perf_regs(struct x86_perf_regs *perf_regs,
+			     struct xregs_state *xsave, u64 bitmap)
+{
+	u64 mask;
+
+	if (!xsave)
+		return;
+
+	/* Filtered by what XSAVE really gives */
+	mask = bitmap & xsave->header.xfeatures;
+
+	if (mask & XFEATURE_MASK_SSE)
+		perf_regs->xmm_space = xsave->i387.xmm_space;
+}
+
+/*
+ * The x86 specific variant of perf_sample_regs_intr().
+ * It would be extended to add more SIMD registers sampling support
+ * in later patches.
+ */
+static void x86_pmu_update_regs_intr(struct perf_event *event,
+				     struct perf_sample_data *data,
+				     struct pt_regs *regs)
+{
+	data->regs_intr.regs = regs;
+	data->regs_intr.abi  = perf_reg_abi(current);
+
+	data->dyn_size += sizeof(u64);
+	if (data->regs_intr.regs) {
+		data->dyn_size += hweight64(event->attr.sample_regs_intr) *
+				  sizeof(u64);
+	}
+
+	/*
+	 * Set PERF_SAMPLE_REGS_INTR to bypass perf_sample_regs_intr() call
+	 * in perf_prepare_sample() function.
+	 */
+	data->sample_flags |= PERF_SAMPLE_REGS_INTR;
+}
+
+static void x86_pmu_sample_xregs(struct perf_event *event,
+				 struct perf_sample_data *data,
+				 u64 ignore_mask)
+{
+	struct xregs_state *xsave = get_ext_regs_buf(smp_processor_id());
+	u64 sample_type = event->attr.sample_type;
+	struct x86_perf_regs *perf_regs;
+	u64 intr_mask = 0;
+	u64 mask = 0;
+
+	if (WARN_ON_ONCE(!xsave) || !in_nmi())
+		return;
+
+	if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
+	    (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK))
+		mask |= XFEATURE_MASK_SSE;
+
+	mask &= x86_pmu.ext_regs_mask;
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR)
+		intr_mask = mask & ~ignore_mask;
+
+	if (intr_mask) {
+		perf_regs = container_of(data->regs_intr.regs,
+					 struct x86_perf_regs, regs);
+		xsave->header.xfeatures = 0;
+		xsaves_nmi(xsave, mask);
+		update_perf_regs(perf_regs, xsave, intr_mask);
+	}
+}
+
+void x86_pmu_update_perf_regs(struct perf_event *event,
+			      struct perf_sample_data *data,
+			      struct pt_regs *regs,
+			      u64 ignore_mask)
+{
+	u64 sample_type = event->attr.sample_type;
+
+	if (!((sample_type & PERF_SAMPLE_REGS_INTR) &&
+	      (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK)))
+		return;
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR)
+		x86_pmu_update_regs_intr(event, data, regs);
+
+	/*
+	 * ignore_mask indicates the PEBS sampled extended regs
+	 * which are unnecessary to sample again.
+	 */
+	x86_pmu_sample_xregs(event, data, ignore_mask);
+}
+
 int x86_pmu_handle_irq(struct pt_regs *regs)
 {
 	struct perf_sample_data data;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 92cb9a716e83..f5d458e3ba3f 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3928,6 +3928,9 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 		if (has_branch_stack(event))
 			intel_pmu_lbr_save_brstack(&data, cpuc, event);
 
+		x86_pmu_clear_perf_regs(regs);
+		x86_pmu_update_perf_regs(event, &data, regs, 0);
+
 		perf_event_overflow(event, &data, regs);
 	}
 
@@ -6176,8 +6179,37 @@ static inline void __intel_update_large_pebs_flags(struct pmu *pmu)
 	}
 }
 
-#define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
+static void intel_extended_regs_init(struct pmu *pmu)
+{
+	struct pmu *dest_pmu = pmu ? pmu : x86_get_pmu(smp_processor_id());
+
+	/*
+	 * Extend the vector registers support to non-PEBS.
+	 * The feature is limited to newer Intel machines with
+	 * PEBS V4+ or archPerfmonExt (0x23) enabled for now.
+	 * In theory, the vector registers can be retrieved as
+	 * long as the CPU supports. The support for the old
+	 * generations may be added later if there is a
+	 * requirement.
+	 * Only support the extension when XSAVES is available.
+	 */
+	if (!boot_cpu_has(X86_FEATURE_XSAVES))
+		return;
 
+	if (!boot_cpu_has(X86_FEATURE_XMM) ||
+	    !cpu_has_xfeatures(XFEATURE_MASK_SSE, NULL))
+		return;
+
+	/*
+	 * On current hybrid platforms, P-cores and E-cores expose the same
+	 * XSAVE feature set. Therefore, using the global x86_pmu.ext_regs_mask
+	 * is sufficient to represent the hardware-supported XSAVE features.
+	 */
+	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
+	dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+}
+
+#define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
 static void update_pmu_cap(struct pmu *pmu)
 {
 	unsigned int eax, ebx, ecx, edx;
@@ -6241,6 +6273,8 @@ static void update_pmu_cap(struct pmu *pmu)
 		/* Perf Metric (Bit 15) and PEBS via PT (Bit 16) are hybrid enumeration */
 		rdmsrq(MSR_IA32_PERF_CAPABILITIES, hybrid(pmu, intel_cap).capabilities);
 	}
+
+	intel_extended_regs_init(pmu);
 }
 
 static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index a31648d2adb1..4f72ce6a9585 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2524,6 +2524,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 	struct pebs_meminfo *meminfo = NULL;
 	struct pebs_gprs *gprs = NULL;
 	struct x86_perf_regs *perf_regs;
+	u64 ignore_mask = 0;
 	u64 format_group;
 	u16 retire;
 
@@ -2531,7 +2532,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		return;
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
-	perf_regs->xmm_regs = NULL;
+	x86_pmu_clear_perf_regs(regs);
 
 	format_group = basic->format_group;
 
@@ -2578,6 +2579,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 	if (format_group & PEBS_DATACFG_XMMS) {
 		struct pebs_xmm *xmm = next_record;
 
+		ignore_mask |= XFEATURE_MASK_SSE;
 		next_record = xmm + 1;
 		perf_regs->xmm_regs = xmm->xmm;
 	}
@@ -2616,6 +2618,8 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		next_record += nr * sizeof(u64);
 	}
 
+	x86_pmu_update_perf_regs(event, data, regs, ignore_mask);
+
 	WARN_ONCE(next_record != __pebs + basic->format_size,
 			"PEBS record size %u, expected %llu, config %llx\n",
 			basic->format_size,
@@ -2641,6 +2645,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 	struct arch_pebs_aux *meminfo = NULL;
 	struct arch_pebs_gprs *gprs = NULL;
 	struct x86_perf_regs *perf_regs;
+	u64 ignore_mask = 0;
 	void *next_record;
 	void *at = __pebs;
 
@@ -2648,7 +2653,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		return;
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
-	perf_regs->xmm_regs = NULL;
+	x86_pmu_clear_perf_regs(regs);
 
 	__setup_perf_sample_data(event, iregs, data);
 
@@ -2703,6 +2708,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 
 		next_record += sizeof(struct arch_pebs_xer_header);
 
+		ignore_mask |= XFEATURE_MASK_SSE;
 		xmm = next_record;
 		perf_regs->xmm_regs = xmm->xmm;
 		next_record = xmm + 1;
@@ -2750,6 +2756,8 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		at = at + header->size;
 		goto again;
 	}
+
+	x86_pmu_update_perf_regs(event, data, regs, ignore_mask);
 }
 
 static inline void *
@@ -3412,7 +3420,11 @@ static void __init intel_ds_pebs_init(void)
 				x86_pmu.flags |= PMU_FL_PEBS_ALL;
 				x86_pmu.pebs_capable = ~0ULL;
 				pebs_qual = "-baseline";
-				x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+				if (boot_cpu_has(X86_FEATURE_XSAVES)) {
+					x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
+					x86_get_pmu(smp_processor_id())->capabilities |=
+							PERF_PMU_CAP_EXTENDED_REGS;
+				}
 			} else {
 				/* Only basic record supported */
 				x86_pmu.large_pebs_flags &=
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index eae24bb35dc1..cff5fbac000b 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1028,6 +1028,12 @@ struct x86_pmu {
 	struct extra_reg *extra_regs;
 	unsigned int flags;
 
+	/*
+	 * Extended regs, e.g., vector registers
+	 * Utilize the same format as the XFEATURE_MASK_*
+	 */
+	u64		ext_regs_mask;
+
 	/*
 	 * Intel host/guest support (KVM)
 	 */
@@ -1314,6 +1320,13 @@ void x86_pmu_enable_event(struct perf_event *event);
 
 int x86_pmu_handle_irq(struct pt_regs *regs);
 
+void x86_pmu_clear_perf_regs(struct pt_regs *regs);
+
+void x86_pmu_update_perf_regs(struct perf_event *event,
+			      struct perf_sample_data *data,
+			      struct pt_regs *regs,
+			      u64 ignore_mask);
+
 void x86_pmu_show_pmu_cap(struct pmu *pmu);
 
 static inline int x86_pmu_num_counters(struct pmu *pmu)
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 38fa8ff26559..19dec5f0b1c7 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -112,6 +112,8 @@ void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
 void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
+
 int xfd_enable_feature(u64 xfd_err);
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 752cb319d5ea..e47a963a7cf0 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -726,7 +726,10 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
-	u64		*xmm_regs;
+	union {
+		u64	*xmm_regs;
+		u32	*xmm_space;	/* for xsaves */
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 4394091c4791..4cef802c2e02 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -587,7 +587,7 @@ static bool __init check_xstate_against_struct(int nr)
 	return true;
 }
 
-static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
 {
 	unsigned int topmost = fls64(xfeatures) -  1;
 	unsigned int offset, i;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 10/23] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-05-29  7:56 ` [Patch v8 10/23] perf/x86: Enable XMM Register Sampling for Non-PEBS Events Dapeng Mi
@ 2026-05-29  9:02   ` sashiko-bot
  2026-05-29 11:38   ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  9:02 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

New issues:
- [High] reserve_ext_regs_buffers() fails to propagate memory allocation errors, leading to a predictable WARN_ON_ONCE in NMI context.
- [High] PEBS events on older platforms (e.g., Skylake) improperly accept `PERF_REG_EXTENDED_MASK`, bypassing capability checks.

Pre-existing issues:
- [High] Software PMU events (or hardware events requesting `PERF_SAMPLE_REGS_USER`) can request XMM registers, causing `perf_reg_value()` to perform an out-of-bounds read on the kernel stack.
--

commit cee33ccbb52239e18206f79aef67c4e7ac96ca59
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86: Enable XMM Register Sampling for Non-PEBS Events

This commit extends XMM register sampling to non-PEBS events by utilizing the
xsaves instruction. It introduces a per-CPU 64-byte aligned buffer to store
SIMD and other registers, allocating it during hardware reservation. The XMM
data is then retrieved during the REGS_INTR case.

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 17c8f44ee43b3..c219a563434d4 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c

[ ... ]

> @@ -410,6 +410,56 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event)

[ ... ]

> +static void reserve_ext_regs_buffers(void)
> +{
> +	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
> +	unsigned int size;
> +	int cpu;
> +
> +	if (!x86_pmu.ext_regs_mask)
> +		return;
> +
> +	/* +64 bytes for the 64 bytes alignment request of xsave area. */
> +	size = xstate_calculate_size(x86_pmu.ext_regs_mask, compacted) + 64;
> +
> +	for_each_possible_cpu(cpu) {
> +		per_cpu(ext_regs_buf, cpu) = kzalloc_node(size, GFP_KERNEL,
> +							  cpu_to_node(cpu));
> +		if (WARN_ON_ONCE(!per_cpu(ext_regs_buf, cpu)))
> +			goto err;
> +	}
> +
> +	return;
> +
> +err:
> +	release_ext_regs_buffers();
> +}

[Severity: High]
Does reserve_ext_regs_buffers() need to propagate the allocation failure back
to the caller?

When kzalloc_node() fails, this function frees any previously allocated buffers
and returns void. The caller, x86_reserve_hardware(), then assumes success
and allows the perf_event_open syscall to succeed. Later, when an event
requiring extended registers triggers a PMI, x86_pmu_sample_xregs() retrieves
a NULL buffer and triggers a WARN_ON_ONCE in NMI context.

Could this allow a local user to deliberately exhaust memory and repeatedly
open a perf event to trigger the warning? If the system has panic_on_warn
enabled, this could escalate to a kernel panic.

[ ... ]

> @@ -655,18 +707,26 @@ int x86_pmu_hw_config(struct perf_event *event)

[ ... ]

> +	if (event->attr.sample_type & PERF_SAMPLE_REGS_INTR) {
> +		/*
> +		 * Besides the general purpose registers, XMM registers may
> +		 * be collected as well.
> +		 */
> +		if (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK) {
> +			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
> +				return -EINVAL;
> +			if (is_sampling_event(event) && !event->attr.precise_ip &&
> +			    !this_cpu_has(X86_FEATURE_XSAVES))
> +				return -EINVAL;
> +		}
> +	}
>  
> -		if (!event->attr.precise_ip)
> +	if (event->attr.sample_type & PERF_SAMPLE_REGS_USER) {
> +		/*
> +		 * Currently XMM registers sampling for REGS_USER is not
> +		 * supported yet.
> +		 */
> +		if (event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK)
>  			return -EINVAL;
>  	}

[Severity: High]
This isn't a bug introduced by this patch, but can software PMU events request
XMM registers and cause an out-of-bounds read on the kernel stack?

While x86_pmu_hw_config() correctly blocks PERF_REG_EXTENDED_MASK for
unsupported scenarios here, software events bypass x86_pmu_hw_config()
completely. The function perf_reg_validate() checks requested registers against
PERF_REG_X86_RESERVED, which fails to exclude bits 32-63 (XMM registers).

If a software event requests XMM registers via PERF_SAMPLE_REGS_USER, and the
event overflows, a standard pt_regs pointing to the top of the kernel stack
is passed to perf_reg_value().

When it uses container_of(regs, struct x86_perf_regs, regs) and reads
perf_regs->xmm_regs, does it read past the end of the pt_regs on the stack,
potentially leaking kernel memory or triggering a page fault?

I note this is addressed later in the patch series by commit "perf/x86: Enable
XMM register sampling for REGS_USER case", but wanted to mention it here.

[ ... ]

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 92cb9a716e83d..f5d458e3ba3f6 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c

[ ... ]

> @@ -6176,8 +6179,37 @@ static inline void __intel_update_large_pebs_flags(struct pmu *pmu)

[ ... ]

> +static void intel_extended_regs_init(struct pmu *pmu)
> +{
> +	struct pmu *dest_pmu = pmu ? pmu : x86_get_pmu(smp_processor_id());
> +
> +	/*
> +	 * Extend the vector registers support to non-PEBS.
> +	 * The feature is limited to newer Intel machines with
> +	 * PEBS V4+ or archPerfmonExt (0x23) enabled for now.
> +	 * In theory, the vector registers can be retrieved as
> +	 * long as the CPU supports. The support for the old
> +	 * generations may be added later if there is a
> +	 * requirement.
> +	 * Only support the extension when XSAVES is available.
> +	 */
> +	if (!boot_cpu_has(X86_FEATURE_XSAVES))
> +		return;
>  
> +	if (!boot_cpu_has(X86_FEATURE_XMM) ||
> +	    !cpu_has_xfeatures(XFEATURE_MASK_SSE, NULL))
> +		return;
> +
> +	/*
> +	 * On current hybrid platforms, P-cores and E-cores expose the same
> +	 * XSAVE feature set. Therefore, using the global x86_pmu.ext_regs_mask
> +	 * is sufficient to represent the hardware-supported XSAVE features.
> +	 */
> +	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
> +	dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
> +}

[Severity: High]
Will this allow PEBS events on older platforms like Skylake to bypass capability
checks?

Before this change, PERF_PMU_CAP_EXTENDED_REGS was only set for CPUs with PEBS
format >= 4 (Icelake+). By introducing intel_extended_regs_init(), the
capability is now set globally for any CPU supporting XSAVES, including
Skylake.

Looking back at x86_pmu_hw_config(), if an event is a PEBS event
(precise_ip > 0), it passes the capability check because the PMU now advertises
it.

Since Skylake's PEBS v3 hardware cannot sample XMM registers, does this result
in a silent ABI regression where the kernel returns zeroes in the sample
payload instead of returning -EINVAL at event creation? Should
x86_pmu_hw_config() enforce that PEBS XMM requests specifically require PEBS
v4+ support?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=10

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Patch v8 10/23] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-05-29  7:56 ` [Patch v8 10/23] perf/x86: Enable XMM Register Sampling for Non-PEBS Events Dapeng Mi
  2026-05-29  9:02   ` sashiko-bot
@ 2026-05-29 11:38   ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2026-05-29 11:38 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Fri, May 29, 2026 at 03:56:32PM +0800, Dapeng Mi wrote:
> Previously, XMM register sampling was only available for PEBS events
> starting from Icelake. Currently the support is now extended to non-PEBS
> events by utilizing the xsaves instruction, thereby completing the
> feature set.
> 
> To implement this, a 64-byte aligned buffer is required. A per-CPU
> ext_regs_buf is introduced to store SIMD and other registers, with an
> approximate size of 2K. The buffer is allocated using kzalloc_node(),
> ensuring natural and 64-byte alignment for all kmalloc() allocations
> with powers of 2.
> 
> XMM sampling for non-PEBS events is supported in the REGS_INTR case.
> Support for REGS_USER will be added in a subsequent patch. For PEBS
> events, XMM register sampling data is directly retrieved from PEBS
> records.
> 
> Future support for additional vector registers (YMM/ZMM/OPMASK) is
> planned. An `ext_regs_mask` is added to track the supported vector
> register groups.
> 
> Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>

I suspect Sashiko's last point is valid and using XMM sampling on older
PEBS will not do the right thing.

Creating PEBS events with XMM reg sampling should fail if the hardware
doesn't support it. That said, I could easily have missed a check for
this, this code is a bit of a maze :/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 11/23] perf/x86: Enable XMM register sampling for REGS_USER case
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (9 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 10/23] perf/x86: Enable XMM Register Sampling for Non-PEBS Events Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  9:24   ` sashiko-bot
  2026-05-29 11:42   ` Peter Zijlstra
  2026-05-29  7:56 ` [Patch v8 12/23] perf: Add sampling support for SIMD registers Dapeng Mi
                   ` (12 subsequent siblings)
  23 siblings, 2 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi, Kan Liang

This patch adds support for XMM register sampling in the REGS_USER case.

To handle simultaneous sampling of XMM registers for both REGS_INTR and
REGS_USER cases, a per-CPU `x86_user_regs` is introduced to store
REGS_USER-specific XMM registers. This prevents REGS_USER-specific XMM
register data from being overwritten by REGS_INTR-specific data if they
share the same `x86_perf_regs` structure.

To sample user-space XMM registers, the `x86_pmu_update_user_xregs()`
helper function is added. It checks if the `TIF_NEED_FPU_LOAD` flag is
set. If so, the user-space XMM register data can be directly retrieved
from the cached task FPU state, as the corresponding hardware registers
have been cleared or switched to kernel-space data. Otherwise, the data
must be read from the hardware registers using the `xsaves` instruction.

For PEBS events, `x86_pmu_update_user_xregs()` checks if the PEBS-sampled
XMM register data belongs to user-space. If so, no further action is
needed. Otherwise, the user-space XMM register data needs to be
re-sampled using the same method as for non-PEBS events.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c       | 150 ++++++++++++++++++++++++++++++-----
 arch/x86/events/intel/core.c |   6 +-
 arch/x86/events/intel/ds.c   |   5 +-
 3 files changed, 138 insertions(+), 23 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index c219a563434d..f9e3f349b69a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -707,12 +707,12 @@ int x86_pmu_hw_config(struct perf_event *event)
 			return -EINVAL;
 	}
 
-	if (event->attr.sample_type & PERF_SAMPLE_REGS_INTR) {
+	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
 		/*
 		 * Besides the general purpose registers, XMM registers may
 		 * be collected as well.
 		 */
-		if (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK) {
+		if (event_has_extended_regs(event)) {
 			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
 				return -EINVAL;
 			if (is_sampling_event(event) && !event->attr.precise_ip &&
@@ -721,15 +721,6 @@ int x86_pmu_hw_config(struct perf_event *event)
 		}
 	}
 
-	if (event->attr.sample_type & PERF_SAMPLE_REGS_USER) {
-		/*
-		 * Currently XMM registers sampling for REGS_USER is not
-		 * supported yet.
-		 */
-		if (event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK)
-			return -EINVAL;
-	}
-
 	return x86_setup_perfctr(event);
 }
 
@@ -1812,33 +1803,155 @@ static void x86_pmu_update_regs_intr(struct perf_event *event,
 	data->sample_flags |= PERF_SAMPLE_REGS_INTR;
 }
 
+/*
+ * When both PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER are set,
+ * an additional x86_perf_regs is required to save user-space registers.
+ * Without this, user-space register data may be overwritten by kernel-space
+ * registers.
+ */
+static DEFINE_PER_CPU(struct x86_perf_regs, x86_user_regs);
+static void x86_pmu_get_regs_user(struct perf_sample_data *data,
+				  struct pt_regs *regs)
+{
+	struct x86_perf_regs *x86_regs_user = this_cpu_ptr(&x86_user_regs);
+	struct perf_regs regs_user;
+
+	x86_pmu_clear_perf_regs(&x86_regs_user->regs);
+
+	perf_get_regs_user(&regs_user, regs);
+	data->regs_user.abi = regs_user.abi;
+	if (regs_user.regs) {
+		x86_regs_user->regs = *regs_user.regs;
+		data->regs_user.regs = &x86_regs_user->regs;
+	} else
+		data->regs_user.regs = NULL;
+}
+
+/*
+ * The x86 specific variant of perf_sample_regs_user().
+ * Update data->regs_user fields for extended registers (e.g., SIMD).
+ */
+static void x86_pmu_update_regs_user(struct perf_event *event,
+				     struct perf_sample_data *data,
+				     struct pt_regs *regs)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	if (user_mode(regs)) {
+		data->regs_user.abi = perf_reg_abi(current);
+		data->regs_user.regs = regs;
+	} else if (is_user_task(current)) {
+		/*
+		 * It cannot guarantee that the kernel will never
+		 * touch the registers outside of the pt_regs,
+		 * especially when more and more registers
+		 * (e.g., SIMD, eGPR) are added. The live data
+		 * cannot be used.
+		 */
+		x86_pmu_get_regs_user(data, regs);
+	} else {
+		data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
+		data->regs_user.regs = NULL;
+	}
+
+	data->dyn_size += sizeof(u64);
+	if (data->regs_user.regs)
+		data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
+
+	/*
+	 * Set PERF_SAMPLE_REGS_USER to bypass perf_sample_regs_user() call
+	 * in perf_prepare_sample() function.
+	 */
+	data->sample_flags |= PERF_SAMPLE_REGS_USER;
+}
+
+/*
+ * This function retrieves cached user-space fpu registers (XMM/YMM/ZMM).
+ * If TIF_NEED_FPU_LOAD is set, it indicates that the user-space FPU state
+ * is cached. Otherwise, the data should be read directly from the hardware
+ * registers.
+ */
+static inline u64 x86_pmu_update_user_xregs(struct perf_sample_data *data,
+					    struct pt_regs *regs,
+					    u64 mask, u64 ignore_mask)
+{
+	struct x86_perf_regs *perf_regs;
+	struct xregs_state *xsave;
+	struct fpu *fpu;
+	struct fpstate *fps;
+	u64 user_mask = mask;
+
+	if (data->regs_user.abi == PERF_SAMPLE_REGS_ABI_NONE)
+		return 0;
+
+	/*
+	 * If PEBS hits kernel space, need to re-sample extended
+	 * registers for user space.
+	 */
+	if (user_mode(regs))
+		user_mask &= ~ignore_mask;
+
+	if (user_mask && test_thread_flag(TIF_NEED_FPU_LOAD)) {
+		perf_regs = container_of(data->regs_user.regs,
+				 struct x86_perf_regs, regs);
+		fpu = x86_task_fpu(current);
+		/*
+		 * If __task_fpstate is set, it holds the right pointer,
+		 * otherwise fpstate will.
+		 */
+		fps = READ_ONCE(fpu->__task_fpstate);
+		if (!fps)
+			fps = fpu->fpstate;
+		xsave = &fps->regs.xsave;
+
+		update_perf_regs(perf_regs, xsave, user_mask);
+		return 0;
+	}
+
+	return user_mask;
+}
+
 static void x86_pmu_sample_xregs(struct perf_event *event,
 				 struct perf_sample_data *data,
+				 struct pt_regs *regs,
 				 u64 ignore_mask)
 {
 	struct xregs_state *xsave = get_ext_regs_buf(smp_processor_id());
 	u64 sample_type = event->attr.sample_type;
 	struct x86_perf_regs *perf_regs;
+	u64 user_mask = 0;
 	u64 intr_mask = 0;
 	u64 mask = 0;
 
 	if (WARN_ON_ONCE(!xsave) || !in_nmi())
 		return;
 
-	if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
-	    (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK))
+	if (event_has_extended_regs(event))
 		mask |= XFEATURE_MASK_SSE;
 
 	mask &= x86_pmu.ext_regs_mask;
+	if (sample_type & PERF_SAMPLE_REGS_USER) {
+		user_mask = x86_pmu_update_user_xregs(data, regs,
+						      mask, ignore_mask);
+	}
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR)
 		intr_mask = mask & ~ignore_mask;
 
+	if (user_mask | intr_mask) {
+		xsave->header.xfeatures = 0;
+		xsaves_nmi(xsave, user_mask | intr_mask);
+	}
+
+	if (user_mask) {
+		perf_regs = container_of(data->regs_user.regs,
+					 struct x86_perf_regs, regs);
+		update_perf_regs(perf_regs, xsave, user_mask);
+	}
+
 	if (intr_mask) {
 		perf_regs = container_of(data->regs_intr.regs,
 					 struct x86_perf_regs, regs);
-		xsave->header.xfeatures = 0;
-		xsaves_nmi(xsave, mask);
 		update_perf_regs(perf_regs, xsave, intr_mask);
 	}
 }
@@ -1850,18 +1963,19 @@ void x86_pmu_update_perf_regs(struct perf_event *event,
 {
 	u64 sample_type = event->attr.sample_type;
 
-	if (!((sample_type & PERF_SAMPLE_REGS_INTR) &&
-	      (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK)))
+	if (!event_has_extended_regs(event))
 		return;
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR)
 		x86_pmu_update_regs_intr(event, data, regs);
+	if (sample_type & PERF_SAMPLE_REGS_USER)
+		x86_pmu_update_regs_user(event, data, regs);
 
 	/*
 	 * ignore_mask indicates the PEBS sampled extended regs
 	 * which are unnecessary to sample again.
 	 */
-	x86_pmu_sample_xregs(event, data, ignore_mask);
+	x86_pmu_sample_xregs(event, data, regs, ignore_mask);
 }
 
 int x86_pmu_handle_irq(struct pt_regs *regs)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f5d458e3ba3f..6c06558c416f 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4698,15 +4698,15 @@ static void intel_pebs_aliases_skl(struct perf_event *event)
 static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.large_pebs_flags;
+	u64 gprs_mask = PEBS_GP_REGS | PERF_REG_EXTENDED_MASK;
 
 	if (event->attr.use_clockid)
 		flags &= ~PERF_SAMPLE_TIME;
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
+	if (event->attr.sample_regs_user & ~gprs_mask)
 		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_intr &
-	    ~(PEBS_GP_REGS | PERF_REG_EXTENDED_MASK))
+	if (event->attr.sample_regs_intr & ~gprs_mask)
 		flags &= ~PERF_SAMPLE_REGS_INTR;
 	return flags;
 }
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 4f72ce6a9585..bd43bf26e6bf 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1749,8 +1749,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
-	    (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
+	if (event_has_extended_regs(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
@@ -2957,6 +2956,8 @@ __intel_pmu_pebs_events(struct perf_event *event,
 	void *at = get_next_pebs_record_by_bit(base, top, bit);
 	int cnt = count;
 
+	x86_pmu_clear_perf_regs(regs);
+
 	if (!iregs)
 		iregs = &dummy_iregs;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 11/23] perf/x86: Enable XMM register sampling for REGS_USER case
  2026-05-29  7:56 ` [Patch v8 11/23] perf/x86: Enable XMM register sampling for REGS_USER case Dapeng Mi
@ 2026-05-29  9:24   ` sashiko-bot
  2026-05-29 11:42   ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  9:24 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Calling `xsaves_nmi()` for PEBS events on CPUs lacking `X86_FEATURE_XSAVES` causes an invalid opcode exception (`#UD`), triggering warning loops and overhead.
- [High] Guest XMM registers are incorrectly sampled as host REGS_USER XMM registers when a PMU interrupt occurs during KVM guest execution.
--

commit feabaffcb00c51c951e3f0889a9ad615fe731d5e
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86: Enable XMM register sampling for REGS_USER case

This patch adds support for XMM register sampling in the REGS_USER case.
To handle simultaneous sampling of XMM registers, a per-CPU x86_user_regs
is introduced to store REGS_USER-specific XMM registers. The
x86_pmu_update_user_xregs helper function is added to retrieve user-space
XMM registers.

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -707,12 +707,12 @@ int x86_pmu_hw_config(struct perf_event *event)
>  			return -EINVAL;
>  	}
>  
> -	if (event->attr.sample_type & PERF_SAMPLE_REGS_INTR) {
> +	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
>  		/*
>  		 * Besides the general purpose registers, XMM registers may
>  		 * be collected as well.
>  		 */
> -		if (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK) {
> +		if (event_has_extended_regs(event)) {
>  			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
>  				return -EINVAL;
>  			if (is_sampling_event(event) && !event->attr.precise_ip &&
>  			    !this_cpu_has(X86_FEATURE_XSAVES))
>  				return -EINVAL;
>  		}
>  	}
[ ... ]
> @@ -1812,33 +1803,155 @@ static void x86_pmu_update_regs_intr(struct perf_event *event,
[ ... ]
> +static inline u64 x86_pmu_update_user_xregs(struct perf_sample_data *data,
> +					    struct pt_regs *regs,
> +					    u64 mask, u64 ignore_mask)
> +{
> +	struct x86_perf_regs *perf_regs;
> +	struct xregs_state *xsave;
> +	struct fpu *fpu;
> +	struct fpstate *fps;
> +	u64 user_mask = mask;
> +
> +	if (data->regs_user.abi == PERF_SAMPLE_REGS_ABI_NONE)
> +		return 0;
> +
> +	/*
> +	 * If PEBS hits kernel space, need to re-sample extended
> +	 * registers for user space.
> +	 */
> +	if (user_mode(regs))
> +		user_mask &= ~ignore_mask;
> +
> +	if (user_mask && test_thread_flag(TIF_NEED_FPU_LOAD)) {

[Severity: High]
When KVM loads a guest's FPU state, fpu_swap_kvm_fpstate() caches the host's
state in fpu->__task_fpstate and explicitly clears the TIF_NEED_FPU_LOAD flag
via fpregs_mark_activate().

If a perf NMI interrupts the guest, this check for TIF_NEED_FPU_LOAD will
evaluate to false, causing x86_pmu_update_user_xregs() to return the mask
instead of reading the cached host state from __task_fpstate.

Does this cause xsaves_nmi() to sample the hardware registers directly, which
currently hold the KVM guest's XMM registers, and expose them to the host's
user-space perf record, leading to cross-VM data leaks?

> +		perf_regs = container_of(data->regs_user.regs,
> +				 struct x86_perf_regs, regs);
> +		fpu = x86_task_fpu(current);
> +		/*
> +		 * If __task_fpstate is set, it holds the right pointer,
> +		 * otherwise fpstate will.
> +		 */
> +		fps = READ_ONCE(fpu->__task_fpstate);
> +		if (!fps)
> +			fps = fpu->fpstate;
> +		xsave = &fps->regs.xsave;
> +
> +		update_perf_regs(perf_regs, xsave, user_mask);
> +		return 0;
> +	}
> +
> +	return user_mask;
> +}
[ ... ]
>  static void x86_pmu_sample_xregs(struct perf_event *event,
>  				 struct perf_sample_data *data,
> +				 struct pt_regs *regs,
>  				 u64 ignore_mask)
>  {
>  	struct xregs_state *xsave = get_ext_regs_buf(smp_processor_id());
>  	u64 sample_type = event->attr.sample_type;
>  	struct x86_perf_regs *perf_regs;
> +	u64 user_mask = 0;
>  	u64 intr_mask = 0;
>  	u64 mask = 0;
>  
>  	if (WARN_ON_ONCE(!xsave) || !in_nmi())
>  		return;
>  
> -	if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
> -	    (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK))
> +	if (event_has_extended_regs(event))
>  		mask |= XFEATURE_MASK_SSE;
>  
>  	mask &= x86_pmu.ext_regs_mask;
> +	if (sample_type & PERF_SAMPLE_REGS_USER) {
> +		user_mask = x86_pmu_update_user_xregs(data, regs,
> +						      mask, ignore_mask);
> +	}
>  
>  	if (sample_type & PERF_SAMPLE_REGS_INTR)
>  		intr_mask = mask & ~ignore_mask;
>  
> +	if (user_mask | intr_mask) {
> +		xsave->header.xfeatures = 0;
> +		xsaves_nmi(xsave, user_mask | intr_mask);
> +	}

[Severity: High]
For PEBS events with PERF_SAMPLE_REGS_USER, the X86_FEATURE_XSAVES check in
x86_pmu_hw_config() is bypassed due to event->attr.precise_ip being set.

If this event triggers in kernel space on a CPU lacking XSAVES (for example,
booted with noxsaves or masked in a VM), user_mode(regs) is false, so
user_mask retains the extended bits from x86_pmu_update_user_xregs().

Since xsaves_nmi() unconditionally emits the xsaves instruction, will this
trigger an Invalid Opcode exception on CPUs without XSAVES support and add
overhead to every PEBS interrupt in kernel space?

> +
> +	if (user_mask) {
> +		perf_regs = container_of(data->regs_user.regs,
> +					 struct x86_perf_regs, regs);
> +		update_perf_regs(perf_regs, xsave, user_mask);
> +	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=11

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Patch v8 11/23] perf/x86: Enable XMM register sampling for REGS_USER case
  2026-05-29  7:56 ` [Patch v8 11/23] perf/x86: Enable XMM register sampling for REGS_USER case Dapeng Mi
  2026-05-29  9:24   ` sashiko-bot
@ 2026-05-29 11:42   ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2026-05-29 11:42 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Fri, May 29, 2026 at 03:56:33PM +0800, Dapeng Mi wrote:
> This patch adds support for XMM register sampling in the REGS_USER case.
> 
> To handle simultaneous sampling of XMM registers for both REGS_INTR and
> REGS_USER cases, a per-CPU `x86_user_regs` is introduced to store
> REGS_USER-specific XMM registers. This prevents REGS_USER-specific XMM
> register data from being overwritten by REGS_INTR-specific data if they
> share the same `x86_perf_regs` structure.
> 
> To sample user-space XMM registers, the `x86_pmu_update_user_xregs()`
> helper function is added. It checks if the `TIF_NEED_FPU_LOAD` flag is
> set. If so, the user-space XMM register data can be directly retrieved
> from the cached task FPU state, as the corresponding hardware registers
> have been cleared or switched to kernel-space data. Otherwise, the data
> must be read from the hardware registers using the `xsaves` instruction.
> 
> For PEBS events, `x86_pmu_update_user_xregs()` checks if the PEBS-sampled
> XMM register data belongs to user-space. If so, no further action is
> needed. Otherwise, the user-space XMM register data needs to be
> re-sampled using the same method as for non-PEBS events.
> 
> Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>

Sashiko has fun comments; I don't think we care about the cross-vm data
leak, that's not worse than we already have on the regular regs. In
fact, it might be considered correct behaviour ;-)

It does have a point about noxsaves; or xsaves being masked by a VM.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 12/23] perf: Add sampling support for SIMD registers
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (10 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 11/23] perf/x86: Enable XMM register sampling for REGS_USER case Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  8:36   ` sashiko-bot
  2026-05-29  7:56 ` [Patch v8 13/23] perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
                   ` (11 subsequent siblings)
  23 siblings, 1 reply; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Users may be interested in sampling SIMD registers during profiling.
The current sample_regs_* structure does not have sufficient space
for all SIMD registers.

To address this, new attribute fields sample_simd_{pred,vec}_reg_* are
added to struct perf_event_attr to represent the SIMD registers that are
expected to be sampled.

Currently, the perf/x86 code supports XMM registers in sample_regs_*.
To unify the configuration of SIMD registers and ensure a consistent
method for configuring XMM and other SIMD registers, a new event
attribute field, sample_simd_regs_enabled, is introduced. When
sample_simd_regs_enabled is set, it indicates that all SIMD registers,
including XMM, will be represented by the newly introduced
sample_simd_{pred|vec}_reg_* fields. The original XMM space in
sample_regs_* is reserved for future uses.

Since SIMD registers are wider than 64 bits, a new output format is
introduced. The number and width of SIMD registers are dumped first,
followed by the register values. The number and width are based on the
user's configuration.

A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is added to indicate the new format.
The enum perf_sample_regs_abi is now a bitmap. This change should not
impact existing tools, as the version and bitmap remain the same for
values 1 and 2.

Additionally, two new __weak functions are introduced:
- perf_simd_reg_value(): Retrieves the value of the requested SIMD
  register.
- perf_simd_reg_validate(): Validates the configuration of the SIMD
  registers.

A new flag, PERF_PMU_CAP_SIMD_REGS, is added to indicate that the PMU
supports SIMD register dumping. An error is generated if
sample_simd_{pred|vec}_reg_* is mistakenly set for a PMU that does not
support this capability.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 include/linux/perf_event.h      |   8 ++
 include/linux/perf_regs.h       |   6 ++
 include/uapi/linux/perf_event.h |  49 +++++++++-
 kernel/events/core.c            | 153 +++++++++++++++++++++++++++++---
 4 files changed, 202 insertions(+), 14 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fb38affa7352..5f0642ef4fd2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -306,6 +306,7 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_AUX_PAUSE		0x0200
 #define PERF_PMU_CAP_AUX_PREFER_LARGE	0x0400
 #define PERF_PMU_CAP_MEDIATED_VPMU	0x0800
+#define PERF_PMU_CAP_SIMD_REGS		0x1000
 
 /**
  * pmu::scope
@@ -1534,6 +1535,13 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_simd_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return attr->sample_simd_regs_enabled != 0;
+}
+
 static inline bool event_has_extended_regs(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 144bcc3ff19f..8fa3eeb14953 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -14,6 +14,12 @@ int perf_reg_validate(u64 mask);
 u64 perf_reg_abi(struct task_struct *task);
 void perf_get_regs_user(struct perf_regs *regs_user,
 			struct pt_regs *regs);
+int perf_simd_reg_validate(u16 simd_enabled, u16 vec_qwords,
+			   u64 vec_mask_intr, u64 vec_mask_user,
+			   u16 pred_qwords, u32 pred_mask_intr,
+			   u32 pred_mask_user);
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred);
 
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index fd10aa8d697f..c49fc76292f7 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -314,8 +314,9 @@ enum {
  */
 enum perf_sample_regs_abi {
 	PERF_SAMPLE_REGS_ABI_NONE		= 0,
-	PERF_SAMPLE_REGS_ABI_32			= 1,
-	PERF_SAMPLE_REGS_ABI_64			= 2,
+	PERF_SAMPLE_REGS_ABI_32			= (1 << 0),
+	PERF_SAMPLE_REGS_ABI_64			= (1 << 1),
+	PERF_SAMPLE_REGS_ABI_SIMD		= (1 << 2),
 };
 
 /*
@@ -383,6 +384,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
 #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
 #define PERF_ATTR_SIZE_VER9			144	/* add: config4 */
+#define PERF_ATTR_SIZE_VER10			176	/* Add: sample_simd_{vec|pred}_reg_* */
 
 /*
  * 'struct perf_event_attr' contains various attributes that define
@@ -547,6 +549,29 @@ struct perf_event_attr {
 
 	__u64	config3; /* extension of config2 */
 	__u64	config4; /* extension of config3 */
+
+	/*
+	 * Defines the sampling SIMD/PRED(predicate) registers bitmap and
+	 * qwords (8 bytes) length.
+	 *
+	 * sample_simd_regs_enabled != 0 indicates there are SIMD/PRED
+	 * registers to be sampled, the SIMD/PRED registers bitmap and
+	 * qwords length are represented in
+	 * sample_simd_{vec|pred}_reg_{intr|user} and
+	 * sample_simd_{vec|pred}_reg_qwords fields separately.
+	 *
+	 * sample_simd_regs_enabled == 0 indicates no SIMD/PRED registers
+	 * are sampled.
+	 */
+	__u16	sample_simd_regs_enabled;
+	__u16	sample_simd_pred_reg_qwords;
+	__u16	sample_simd_vec_reg_qwords;
+	__u16	__reserved_4;
+
+	__u32	sample_simd_pred_reg_intr;
+	__u32	sample_simd_pred_reg_user;
+	__u64	sample_simd_vec_reg_intr;
+	__u64	sample_simd_vec_reg_user;
 };
 
 /*
@@ -1020,7 +1045,15 @@ enum perf_event_type {
 	 *      } && PERF_SAMPLE_BRANCH_STACK
 	 *
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u64 nr_vectors;		# 0 ... weight(sample_simd_vec_reg_user)
+	 *		u64 vector_qwords;	# 0 ... sample_simd_vec_reg_qwords
+	 *		u64 nr_pred;		# 0 ... weight(sample_simd_pred_reg_user)
+	 *		u64 pred_qwords;	# 0 ... sample_simd_pred_reg_qwords
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_USER
 	 *
 	 *	{ u64			size;
 	 *	  char			data[size];
@@ -1047,7 +1080,15 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u64 nr_vectors;		# 0 ... weight(sample_simd_vec_reg_intr)
+	 *		u64 vector_qwords;	# 0 ... sample_simd_vec_reg_qwords
+	 *		u64 nr_pred;		# 0 ... weight(sample_simd_pred_reg_intr)
+	 *		u64 pred_qwords;	# 0 ... sample_simd_pred_reg_qwords
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2ce553db4dcb..94bb034da9b9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7791,22 +7791,60 @@ void __weak perf_get_regs_user(struct perf_regs *regs_user,
 	regs_user->abi = perf_reg_abi(current);
 }
 
+#define word_for_each_set_bit(bit, val)			\
+	for (unsigned long long __v = (val);		\
+	     __v && ((bit = __builtin_ctzll(__v)), 1);	\
+	     __v &= __v - 1)
+
 static void
 perf_output_sample_regs(struct perf_output_handle *handle,
 			struct pt_regs *regs, u64 mask)
 {
 	int bit;
-	DECLARE_BITMAP(_mask, 64);
-
-	bitmap_from_u64(_mask, mask);
-	for_each_set_bit(bit, _mask, sizeof(mask) * BITS_PER_BYTE) {
-		u64 val;
 
-		val = perf_reg_value(regs, bit);
+	word_for_each_set_bit(bit, mask) {
+		u64 val = perf_reg_value(regs, bit);
 		perf_output_put(handle, val);
 	}
 }
 
+static void
+perf_output_sample_simd_regs(struct perf_output_handle *handle,
+			     struct perf_event *event,
+			     struct pt_regs *regs,
+			     u64 mask, u32 pred_mask)
+{
+	u64 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
+	u64 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
+	u64 nr_vectors = hweight64(mask);
+	u64 nr_pred = hweight32(pred_mask);
+	int bit;
+
+	perf_output_put(handle, nr_vectors);
+	perf_output_put(handle, vec_qwords);
+	perf_output_put(handle, nr_pred);
+	perf_output_put(handle, pred_qwords);
+
+	if (nr_vectors) {
+		word_for_each_set_bit(bit, mask) {
+			for (int i = 0; i < vec_qwords; i++) {
+				u64 val = perf_simd_reg_value(regs, bit,
+							      i, false);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+	if (nr_pred) {
+		word_for_each_set_bit(bit, pred_mask) {
+			for (int i = 0; i < pred_qwords; i++) {
+				u64 val = perf_simd_reg_value(regs, bit,
+							      i, true);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+}
+
 static void perf_sample_regs_user(struct perf_regs *regs_user,
 				  struct pt_regs *regs)
 {
@@ -7828,6 +7866,22 @@ static void perf_sample_regs_intr(struct perf_regs *regs_intr,
 	regs_intr->abi  = perf_reg_abi(current);
 }
 
+int __weak perf_simd_reg_validate(u16 simd_enabled, u16 vec_qwords,
+				  u64 vec_mask_intr, u64 vec_mask_user,
+				  u16 pred_qwords, u32 pred_mask_intr,
+				  u32 pred_mask_user)
+{
+	return simd_enabled ||
+	       vec_qwords || vec_mask_intr || vec_mask_user ||
+	       pred_qwords || pred_mask_intr || pred_mask_user ?
+	       -EINVAL : 0;
+}
+
+u64 __weak perf_simd_reg_value(struct pt_regs *regs, int idx,
+			       u16 qwords_idx, bool pred)
+{
+	return 0;
+}
 
 /*
  * Get remaining task size from user stack pointer.
@@ -8358,10 +8412,17 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_user;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_user;
 			perf_output_sample_regs(handle,
 						data->regs_user.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_user.regs,
+							     attr->sample_simd_vec_reg_user,
+							     attr->sample_simd_pred_reg_user);
+			}
 		}
 	}
 
@@ -8389,11 +8450,18 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_intr;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_intr;
 
 			perf_output_sample_regs(handle,
 						data->regs_intr.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_intr.regs,
+							     attr->sample_simd_vec_reg_intr,
+							     attr->sample_simd_pred_reg_intr);
+			}
 		}
 	}
 
@@ -8596,6 +8664,33 @@ static __always_inline u64 __cond_set(u64 flags, u64 s, u64 d)
 	return d * !!(flags & s);
 }
 
+static u64 perf_update_xregs_size(struct perf_event *event, bool intr)
+{
+	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
+	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
+	u64 pred_mask;
+	u64 mask;
+	int size;
+
+	if (intr) {
+		mask = event->attr.sample_simd_vec_reg_intr;
+		pred_mask = event->attr.sample_simd_pred_reg_intr;
+	} else {
+		mask = event->attr.sample_simd_vec_reg_user;
+		pred_mask = event->attr.sample_simd_pred_reg_user;
+	}
+
+	/* nr_vectors, vector_qwords, nr_pred, pred_qwords */
+	size = sizeof(u64) * 4;
+	size += (hweight64(mask) * vec_qwords +
+		 hweight64(pred_mask) * pred_qwords) * sizeof(u64);
+
+	/* Warn if exceeding perf_event_header.size (u16). */
+	WARN_ON_ONCE(size > U16_MAX);
+
+	return size;
+}
+
 void perf_prepare_sample(struct perf_sample_data *data,
 			 struct perf_event *event,
 			 struct pt_regs *regs)
@@ -8661,6 +8756,11 @@ void perf_prepare_sample(struct perf_sample_data *data,
 			size += hweight64(mask) * sizeof(u64);
 		}
 
+		if (data->regs_user.abi && event_has_simd_regs(event)) {
+			size += perf_update_xregs_size(event, false);
+			data->regs_user.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		}
+
 		data->dyn_size += size;
 		data->sample_flags |= PERF_SAMPLE_REGS_USER;
 	}
@@ -8724,6 +8824,11 @@ void perf_prepare_sample(struct perf_sample_data *data,
 			size += hweight64(mask) * sizeof(u64);
 		}
 
+		if (data->regs_intr.abi && event_has_simd_regs(event)) {
+			size += perf_update_xregs_size(event, true);
+			data->regs_intr.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		}
+
 		data->dyn_size += size;
 		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
 	}
@@ -13089,6 +13194,12 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 	if (ret)
 		goto err_pmu;
 
+	if (!(pmu->capabilities & PERF_PMU_CAP_SIMD_REGS) &&
+	    event_has_simd_regs(event)) {
+		ret = -EOPNOTSUPP;
+		goto err_destroy;
+	}
+
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
 	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
@@ -13585,7 +13696,8 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 
 	attr->size = size;
 
-	if (attr->__reserved_1 || attr->__reserved_2 || attr->__reserved_3)
+	if (attr->__reserved_1 || attr->__reserved_2 ||
+	    attr->__reserved_3 || attr->__reserved_4)
 		return -EINVAL;
 
 	if (attr->sample_type & ~(PERF_SAMPLE_MAX-1))
@@ -13634,6 +13746,15 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		ret = perf_reg_validate(attr->sample_regs_user);
 		if (ret)
 			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_regs_enabled,
+					     attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_intr,
+					     attr->sample_simd_vec_reg_user,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_intr,
+					     attr->sample_simd_pred_reg_user);
+		if (ret)
+			return ret;
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_STACK_USER) {
@@ -13654,8 +13775,20 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (!attr->sample_max_stack)
 		attr->sample_max_stack = sysctl_perf_event_max_stack;
 
-	if (attr->sample_type & PERF_SAMPLE_REGS_INTR)
+	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
 		ret = perf_reg_validate(attr->sample_regs_intr);
+		if (ret)
+			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_regs_enabled,
+					     attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_intr,
+					     attr->sample_simd_vec_reg_user,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_intr,
+					     attr->sample_simd_pred_reg_user);
+		if (ret)
+			return ret;
+	}
 
 #ifndef CONFIG_CGROUP_PERF
 	if (attr->sample_type & PERF_SAMPLE_CGROUP)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 12/23] perf: Add sampling support for SIMD registers
  2026-05-29  7:56 ` [Patch v8 12/23] perf: Add sampling support for SIMD registers Dapeng Mi
@ 2026-05-29  8:36   ` sashiko-bot
  0 siblings, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  8:36 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] Missing bounds checking on user-provided SIMD `qwords` fields allows an integer overflow of the 16-bit perf event header size, leading to ring buffer corruption and potential NMI lockups.
- [High] SIMD register parameters are evaluated and validated outside of their respective flag gates, which can falsely reject valid syscalls due to uninitialized memory.
--

commit eeb01562d09b920ab2f41841dd6b7a879ed10465
Author: Kan Liang <kan.liang@linux.intel.com>

perf: Add sampling support for SIMD registers

This commit introduces support for sampling SIMD registers during profiling
by adding new attribute fields to struct perf_event_attr. It adds a new ABI
format to dump the width and values of the requested SIMD registers, using
bitmasks and qword sizes configured by the user.

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 2ce553db4dcb8..94bb034da9b9d 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c

[ ... ]

> @@ -8596,6 +8664,33 @@ static __always_inline u64 __cond_set(u64 flags, u64 s, u64 d)
>  	return d * !!(flags & s);
>  }
>  
> +static u64 perf_update_xregs_size(struct perf_event *event, bool intr)
> +{
> +	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
> +	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
> +	u64 pred_mask;
> +	u64 mask;
> +	int size;
> +
> +	if (intr) {
> +		mask = event->attr.sample_simd_vec_reg_intr;
> +		pred_mask = event->attr.sample_simd_pred_reg_intr;
> +	} else {
> +		mask = event->attr.sample_simd_vec_reg_user;
> +		pred_mask = event->attr.sample_simd_pred_reg_user;
> +	}
> +
> +	/* nr_vectors, vector_qwords, nr_pred, pred_qwords */
> +	size = sizeof(u64) * 4;
> +	size += (hweight64(mask) * vec_qwords +
> +		 hweight64(pred_mask) * pred_qwords) * sizeof(u64);
> +

[Severity: Critical]
Is there any bounds checking on the user-provided vec_qwords and pred_qwords
fields here?

If userspace provides large values, could the multiplication with the bitmask
weights overflow the 16-bit perf event header size? 

If the size overflows, would perf_output_begin() allocate a smaller ring buffer
space than what perf_output_sample_simd_regs() subsequently writes, leading to
an out-of-bounds memory regression in NMI context?

> +	/* Warn if exceeding perf_event_header.size (u16). */
> +	WARN_ON_ONCE(size > U16_MAX);
> +
> +	return size;
> +}

[ ... ]

> @@ -13634,6 +13746,15 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
>  		ret = perf_reg_validate(attr->sample_regs_user);
>  		if (ret)
>  			return ret;
> +		ret = perf_simd_reg_validate(attr->sample_simd_regs_enabled,
> +					     attr->sample_simd_vec_reg_qwords,
> +					     attr->sample_simd_vec_reg_intr,
> +					     attr->sample_simd_vec_reg_user,
> +					     attr->sample_simd_pred_reg_qwords,
> +					     attr->sample_simd_pred_reg_intr,
> +					     attr->sample_simd_pred_reg_user);
> +		if (ret)
> +			return ret;
>  	}

[Severity: High]
When attr->sample_type & PERF_SAMPLE_REGS_USER is true but
PERF_SAMPLE_REGS_INTR is not set, does this validate the _intr fields that
might contain uninitialized garbage from userspace?

Since flag-gated syscall parameters are not required to be zero-filled when
the flag is absent, could evaluating these parameters outside their respective
flag gates falsely reject valid syscalls due to uninitialized memory?

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=12

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 13/23] perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (11 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 12/23] perf: Add sampling support for SIMD registers Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  8:49   ` sashiko-bot
  2026-05-29  7:56 ` [Patch v8 14/23] perf/x86: Support YMM " Dapeng Mi
                   ` (10 subsequent siblings)
  23 siblings, 1 reply; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi, Kan Liang

This patch adds support for sampling XMM registers using the
sample_simd_vec_reg_* fields.

When sample_simd_regs_enabled is set, the original XMM space in the
sample_regs_* field is treated as reserved. An INVAL error will be
reported to user space if any bit is set in the original XMM space while
sample_simd_regs_enabled is set.

The perf_reg_value function requires ABI information to understand the
layout of sample_regs. To accommodate this, a new abi field is introduced
in the struct x86_perf_regs to represent ABI information.

Additionally, the X86-specific perf_simd_reg_value function is implemented
to retrieve the XMM register values.

Please note XMM sampling is not enabled yet, it will be enabled in a later
patch when PERF_PMU_CAP_SIMD_REGS is set.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 48 +++++++++++++++--
 arch/x86/events/intel/ds.c            |  2 +-
 arch/x86/events/perf_event.h          | 12 +++++
 arch/x86/include/asm/perf_event.h     |  1 +
 arch/x86/include/uapi/asm/perf_regs.h | 13 +++++
 arch/x86/kernel/perf_regs.c           | 74 ++++++++++++++++++++++++++-
 include/linux/perf_event.h            |  1 +
 kernel/events/core.c                  |  2 +-
 8 files changed, 145 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f9e3f349b69a..5a4760a1716b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -718,6 +718,20 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (is_sampling_event(event) && !event->attr.precise_ip &&
 			    !this_cpu_has(X86_FEATURE_XSAVES))
 				return -EINVAL;
+			if (event->attr.sample_simd_regs_enabled)
+				return -EINVAL;
+		}
+
+		if (event_has_simd_regs(event)) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
+				return -EINVAL;
+			if (is_sampling_event(event) && !event->attr.precise_ip &&
+			    !this_cpu_has(X86_FEATURE_XSAVES))
+				return -EINVAL;
+			/* The vector registers set is not supported */
+			if (event_needs_xmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
+				return -EINVAL;
 		}
 	}
 
@@ -1760,6 +1774,7 @@ void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 {
 	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
 
+	perf_regs->abi = PERF_SAMPLE_REGS_ABI_NONE;
 	perf_regs->xmm_regs = NULL;
 }
 
@@ -1780,13 +1795,14 @@ static void update_perf_regs(struct x86_perf_regs *perf_regs,
 
 /*
  * The x86 specific variant of perf_sample_regs_intr().
- * It would be extended to add more SIMD registers sampling support
- * in later patches.
+ * Update data->regs_intr fields for extended registers (e.g., SIMD).
  */
 static void x86_pmu_update_regs_intr(struct perf_event *event,
 				     struct perf_sample_data *data,
 				     struct pt_regs *regs)
 {
+	struct x86_perf_regs *perf_regs;
+
 	data->regs_intr.regs = regs;
 	data->regs_intr.abi  = perf_reg_abi(current);
 
@@ -1796,6 +1812,17 @@ static void x86_pmu_update_regs_intr(struct perf_event *event,
 				  sizeof(u64);
 	}
 
+	if (data->regs_intr.abi && event_has_simd_regs(event)) {
+		data->dyn_size += perf_update_xregs_size(event, true);
+		data->regs_intr.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+	}
+
+	if (data->regs_intr.abi) {
+		perf_regs = container_of(data->regs_intr.regs,
+					 struct x86_perf_regs, regs);
+		perf_regs->abi = data->regs_intr.abi;
+	}
+
 	/*
 	 * Set PERF_SAMPLE_REGS_INTR to bypass perf_sample_regs_intr() call
 	 * in perf_prepare_sample() function.
@@ -1836,6 +1863,7 @@ static void x86_pmu_update_regs_user(struct perf_event *event,
 				     struct pt_regs *regs)
 {
 	struct perf_event_attr *attr = &event->attr;
+	struct x86_perf_regs *perf_regs;
 
 	if (user_mode(regs)) {
 		data->regs_user.abi = perf_reg_abi(current);
@@ -1858,6 +1886,17 @@ static void x86_pmu_update_regs_user(struct perf_event *event,
 	if (data->regs_user.regs)
 		data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
 
+	if (data->regs_user.abi && event_has_simd_regs(event)) {
+		data->dyn_size += perf_update_xregs_size(event, false);
+		data->regs_user.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+	}
+
+	if (data->regs_user.abi) {
+		perf_regs = container_of(data->regs_user.regs,
+					 struct x86_perf_regs, regs);
+		perf_regs->abi = data->regs_user.abi;
+	}
+
 	/*
 	 * Set PERF_SAMPLE_REGS_USER to bypass perf_sample_regs_user() call
 	 * in perf_prepare_sample() function.
@@ -1926,7 +1965,7 @@ static void x86_pmu_sample_xregs(struct perf_event *event,
 	if (WARN_ON_ONCE(!xsave) || !in_nmi())
 		return;
 
-	if (event_has_extended_regs(event))
+	if (event_needs_xmm(event))
 		mask |= XFEATURE_MASK_SSE;
 
 	mask &= x86_pmu.ext_regs_mask;
@@ -1963,7 +2002,8 @@ void x86_pmu_update_perf_regs(struct perf_event *event,
 {
 	u64 sample_type = event->attr.sample_type;
 
-	if (!event_has_extended_regs(event))
+	if (!event_needs_xmm(event) &&
+	    !event_has_simd_regs(event))
 		return;
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index bd43bf26e6bf..609d4a83115d 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1749,7 +1749,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if (event_has_extended_regs(event))
+	if (event_needs_xmm(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index cff5fbac000b..b04f5ba3294a 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -147,6 +147,18 @@ static inline bool is_acr_self_reload_event(struct perf_event *event)
 	return test_bit(hwc->idx, (unsigned long *)&hwc->config1);
 }
 
+static inline bool event_needs_xmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_XMM_QWORDS)
+		return true;
+
+	if (!event->attr.sample_simd_regs_enabled &&
+	    event_has_extended_regs(event))
+		return true;
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index e47a963a7cf0..e54d21c13494 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -726,6 +726,7 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
+	u64		abi;
 	union {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..5b7d5216f0bd 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -55,4 +55,17 @@ enum perf_event_x86_regs {
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
 
+enum {
+	PERF_X86_SIMD_XMM_REGS      = 16,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
+};
+
+#define PERF_X86_SIMD_VEC_MASK	__GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
+
+enum {
+	/* 1 qword = 8 bytes */
+	PERF_X86_XMM_QWORDS      = 2,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
+};
+
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 81204cb7f723..7b9b38c189de 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -63,6 +63,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
+			return 0;
 		if (!perf_regs->xmm_regs)
 			return 0;
 		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
@@ -74,6 +77,71 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred)
+{
+	struct x86_perf_regs *perf_regs =
+			container_of(regs, struct x86_perf_regs, regs);
+
+	if (!(perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD))
+		return 0;
+
+	if (pred)
+		return 0;
+
+	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
+			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
+		return 0;
+
+	if (qwords_idx < PERF_X86_XMM_QWORDS) {
+		if (!perf_regs->xmm_regs)
+			return 0;
+		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS +
+					   qwords_idx];
+	}
+
+	return 0;
+}
+
+int perf_simd_reg_validate(u16 simd_enabled, u16 vec_qwords,
+			   u64 vec_mask_intr, u64 vec_mask_user,
+			   u16 pred_qwords, u32 pred_mask_intr,
+			   u32 pred_mask_user)
+{
+	u64 size;
+
+	if (!simd_enabled) {
+		if (vec_qwords || vec_mask_intr || vec_mask_user ||
+		    pred_qwords || pred_mask_intr || pred_mask_user)
+			return -EINVAL;
+		return 0;
+	}
+
+	if (!vec_qwords) {
+		if (vec_mask_intr || vec_mask_user)
+			return -EINVAL;
+	} else {
+		if (vec_qwords != PERF_X86_XMM_QWORDS)
+			return -EINVAL;
+		if ((!vec_mask_intr && !vec_mask_user) ||
+		    (vec_mask_intr & ~PERF_X86_SIMD_VEC_MASK) ||
+		    (vec_mask_user & ~PERF_X86_SIMD_VEC_MASK))
+			return -EINVAL;
+	}
+
+	if (pred_qwords || pred_mask_intr || pred_mask_user)
+		return -EINVAL;
+
+	size = ((vec_qwords * hweight64(vec_mask_intr)) +
+		(vec_qwords * hweight64(vec_mask_user)) +
+		(pred_qwords * hweight32(pred_mask_intr)) +
+		(pred_qwords * hweight32(pred_mask_user))) * sizeof(u64);
+	if (size > U16_MAX)
+		return -EINVAL;
+
+	return 0;
+}
+
 #define PERF_REG_X86_RESERVED	(((1ULL << PERF_REG_X86_XMM0) - 1) & \
 				 ~((1ULL << PERF_REG_X86_MAX) - 1))
 
@@ -89,7 +157,8 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 int perf_reg_validate(u64 mask)
 {
-	if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
+	/* The mask could be 0 if only the SIMD registers are interested */
+	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
 		return -EINVAL;
 
 	return 0;
@@ -108,7 +177,8 @@ u64 perf_reg_abi(struct task_struct *task)
 
 int perf_reg_validate(u64 mask)
 {
-	if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
+	/* The mask could be 0 if only the SIMD registers are interested */
+	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
 		return -EINVAL;
 
 	return 0;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5f0642ef4fd2..baf694203d23 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1485,6 +1485,7 @@ static inline void perf_clear_branch_entry_bitfields(struct perf_branch_entry *b
 	br->reserved	= 0;
 }
 
+extern u64 perf_update_xregs_size(struct perf_event *event, bool intr);
 extern void perf_output_sample(struct perf_output_handle *handle,
 			       struct perf_event_header *header,
 			       struct perf_sample_data *data,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 94bb034da9b9..afd5b1408231 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8664,7 +8664,7 @@ static __always_inline u64 __cond_set(u64 flags, u64 s, u64 d)
 	return d * !!(flags & s);
 }
 
-static u64 perf_update_xregs_size(struct perf_event *event, bool intr)
+u64 perf_update_xregs_size(struct perf_event *event, bool intr)
 {
 	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
 	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 13/23] perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields
  2026-05-29  7:56 ` [Patch v8 13/23] perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
@ 2026-05-29  8:49   ` sashiko-bot
  0 siblings, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  8:49 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] The validation logic in `perf_simd_reg_validate()` strictly checks unused, flag-gated syscall parameters outside of their respective flag scopes, breaking ABI compatibility.
--

commit aeb4cf15401ade404bfd333c1c769dd23251abf8
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields

This patch adds support for sampling XMM registers using the new
sample_simd_vec_reg_* fields. It introduces ABI tracking for perf_regs
on x86 and implements validation logic for SIMD registers.

> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
> index 81204cb7f7239..7b9b38c189de2 100644
> --- a/arch/x86/kernel/perf_regs.c
> +++ b/arch/x86/kernel/perf_regs.c
[ ... ]
> @@ -74,6 +77,71 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
[ ... ]
> +int perf_simd_reg_validate(u16 simd_enabled, u16 vec_qwords,
> +			   u64 vec_mask_intr, u64 vec_mask_user,
> +			   u16 pred_qwords, u32 pred_mask_intr,
> +			   u32 pred_mask_user)
> +{
> +	u64 size;
> +
> +	if (!simd_enabled) {
> +		if (vec_qwords || vec_mask_intr || vec_mask_user ||
> +		    pred_qwords || pred_mask_intr || pred_mask_user)
> +			return -EINVAL;

[Severity: Critical]
Are we unconditionally validating syscall parameters outside their flag scopes
here?

If userspace sets PERF_SAMPLE_REGS_USER but does not set
PERF_SAMPLE_REGS_INTR, the vec_mask_intr and pred_mask_intr parameters
could contain uninitialized stack garbage. Because userspace is not required
to zero-fill unused arguments, returning -EINVAL for garbage in these unused
masks could break ABI compatibility for programs recompiled against newer
kernel headers.

> +		return 0;
> +	}
> +
> +	if (!vec_qwords) {
> +		if (vec_mask_intr || vec_mask_user)
> +			return -EINVAL;
> +	} else {
> +		if (vec_qwords != PERF_X86_XMM_QWORDS)
> +			return -EINVAL;
> +		if ((!vec_mask_intr && !vec_mask_user) ||
> +		    (vec_mask_intr & ~PERF_X86_SIMD_VEC_MASK) ||
> +		    (vec_mask_user & ~PERF_X86_SIMD_VEC_MASK))
> +			return -EINVAL;

[Severity: Critical]
Similarly, could checking these masks cause false -EINVAL errors
if only one of the sampling flags (INTR or USER) is enabled?

If PERF_SAMPLE_REGS_USER is enabled but PERF_SAMPLE_REGS_INTR is disabled,
vec_mask_intr might contain random bits that fail the PERF_X86_SIMD_VEC_MASK
check. Should these validation checks be gated behind the presence of their
respective flags in the sample_type to prevent regressions?

> +	}
> +
> +	if (pred_qwords || pred_mask_intr || pred_mask_user)
> +		return -EINVAL;
> +
> +	size = ((vec_qwords * hweight64(vec_mask_intr)) +
> +		(vec_qwords * hweight64(vec_mask_user)) +
> +		(pred_qwords * hweight32(pred_mask_intr)) +
> +		(pred_qwords * hweight32(pred_mask_user))) * sizeof(u64);
> +	if (size > U16_MAX)
> +		return -EINVAL;
> +
> +	return 0;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=13

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 14/23] perf/x86: Support YMM sampling using sample_simd_vec_reg_* fields
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (12 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 13/23] perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  8:47   ` sashiko-bot
  2026-05-29  7:56 ` [Patch v8 15/23] perf/x86: Support ZMM " Dapeng Mi
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi, Kan Liang

This patch introduces support for sampling YMM registers via the
sample_simd_vec_reg_* fields.

Each YMM register consists of 4 u64 words, assembled from two halves:
XMM (the lower 2 u64 words) and YMMH (the upper 2 u64 words). Although
both XMM and YMMH data can be retrieved with a single xsaves instruction,
they are stored in separate locations. The perf_simd_reg_value() function
is responsible for assembling these halves into a complete YMM register
for output to userspace.

Additionally, sample_simd_vec_reg_qwords should be set to 4 to indicate
YMM sampling.

Please note YMM sampling is not enabled yet, it will be enabled in a later
patch when PERF_PMU_CAP_SIMD_REGS is set.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                |  8 ++++++++
 arch/x86/events/perf_event.h          |  9 +++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  6 ++++--
 arch/x86/kernel/perf_regs.c           | 10 +++++++++-
 5 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5a4760a1716b..d39710f42ca0 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -732,6 +732,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_xmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
 				return -EINVAL;
+			if (event_needs_ymm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
+				return -EINVAL;
 		}
 	}
 
@@ -1776,6 +1779,7 @@ void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 
 	perf_regs->abi = PERF_SAMPLE_REGS_ABI_NONE;
 	perf_regs->xmm_regs = NULL;
+	perf_regs->ymmh_regs = NULL;
 }
 
 static void update_perf_regs(struct x86_perf_regs *perf_regs,
@@ -1791,6 +1795,8 @@ static void update_perf_regs(struct x86_perf_regs *perf_regs,
 
 	if (mask & XFEATURE_MASK_SSE)
 		perf_regs->xmm_space = xsave->i387.xmm_space;
+	if (mask & XFEATURE_MASK_YMM)
+		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
 }
 
 /*
@@ -1967,6 +1973,8 @@ static void x86_pmu_sample_xregs(struct perf_event *event,
 
 	if (event_needs_xmm(event))
 		mask |= XFEATURE_MASK_SSE;
+	if (event_needs_ymm(event))
+		mask |= XFEATURE_MASK_YMM;
 
 	mask &= x86_pmu.ext_regs_mask;
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index b04f5ba3294a..5111eaf8b12a 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -159,6 +159,15 @@ static inline bool event_needs_xmm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_ymm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS)
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index e54d21c13494..1d03b86be65d 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -731,6 +731,10 @@ struct x86_perf_regs {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
 	};
+	union {
+		u64	*ymmh_regs;
+		struct ymmh_struct *ymmh;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 5b7d5216f0bd..8f513229fbb8 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -57,7 +57,8 @@ enum perf_event_x86_regs {
 
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
-	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
+	PERF_X86_SIMD_YMM_REGS      = 16,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_YMM_REGS,
 };
 
 #define PERF_X86_SIMD_VEC_MASK	__GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
@@ -65,7 +66,8 @@ enum {
 enum {
 	/* 1 qword = 8 bytes */
 	PERF_X86_XMM_QWORDS      = 2,
-	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
+	PERF_X86_YMM_QWORDS      = 4,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_YMM_QWORDS,
 };
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 7b9b38c189de..9792483360c7 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -77,6 +77,8 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
+#define PERF_X86_YMMH_QWORDS	(PERF_X86_YMM_QWORDS / 2)
+
 u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			u16 qwords_idx, bool pred)
 {
@@ -98,6 +100,11 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			return 0;
 		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS +
 					   qwords_idx];
+	} else if (qwords_idx < PERF_X86_YMM_QWORDS) {
+		if (!perf_regs->ymmh_regs)
+			return 0;
+		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS +
+					    qwords_idx - PERF_X86_XMM_QWORDS];
 	}
 
 	return 0;
@@ -121,7 +128,8 @@ int perf_simd_reg_validate(u16 simd_enabled, u16 vec_qwords,
 		if (vec_mask_intr || vec_mask_user)
 			return -EINVAL;
 	} else {
-		if (vec_qwords != PERF_X86_XMM_QWORDS)
+		if (vec_qwords != PERF_X86_XMM_QWORDS &&
+		    vec_qwords != PERF_X86_YMM_QWORDS)
 			return -EINVAL;
 		if ((!vec_mask_intr && !vec_mask_user) ||
 		    (vec_mask_intr & ~PERF_X86_SIMD_VEC_MASK) ||
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 14/23] perf/x86: Support YMM sampling using sample_simd_vec_reg_* fields
  2026-05-29  7:56 ` [Patch v8 14/23] perf/x86: Support YMM " Dapeng Mi
@ 2026-05-29  8:47   ` sashiko-bot
  0 siblings, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  8:47 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [Medium] The `event_needs_ymm()` function evaluates flag-gated syscall parameters outside their corresponding flag gate (`PERF_SAMPLE_REGS_USER` / `PERF_SAMPLE_REGS_INTR`).

Pre-existing issues:
- [High] The function `perf_simd_reg_validate` improperly returns an error when unused, flag-gated parameters are not zero-filled, breaking ABI compatibility.
--

commit f738764532819a2044a1239c400d83c8404d4843
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86: Support YMM sampling using sample_simd_vec_reg_* fields

This patch introduces support for sampling YMM registers via the
sample_simd_vec_reg_* fields. It updates the perf_simd_reg_value function
to assemble the XMM and YMMH halves into a complete YMM register and
sets the sample_simd_vec_reg_qwords to 4 for YMM sampling.

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
[ ... ]
> @@ -1967,6 +1973,8 @@ static void x86_pmu_sample_xregs(struct perf_event *event,
>  
>  	if (event_needs_xmm(event))
>  		mask |= XFEATURE_MASK_SSE;
> +	if (event_needs_ymm(event))
> +		mask |= XFEATURE_MASK_YMM;
>  
>  	mask &= x86_pmu.ext_regs_mask;
>  	if (sample_type & PERF_SAMPLE_REGS_USER) {

[Severity: Medium]
Does this evaluate flag-gated syscall parameters outside their flag gate?

According to the syscall parameter trust boundaries, userspace is not
required to zero-fill unused fields, so they may contain arbitrary garbage
when their flag is absent.

The event_needs_ymm() helper checks event->attr.sample_simd_regs_enabled
and event->attr.sample_simd_vec_reg_qwords.

However, in x86_pmu_sample_xregs(), this check happens before we verify if
the sample type actually includes PERF_SAMPLE_REGS_USER or
PERF_SAMPLE_REGS_INTR.

Could this cause the kernel to read uninitialized userspace values when
those sample_type flags are not set?

> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
> --- a/arch/x86/kernel/perf_regs.c
> +++ b/arch/x86/kernel/perf_regs.c
[ ... ]
> @@ -121,7 +128,8 @@ int perf_simd_reg_validate(u16 simd_enabled, u16 vec_qwords,
>  		if (vec_mask_intr || vec_mask_user)
>  			return -EINVAL;
>  	} else {
> -		if (vec_qwords != PERF_X86_XMM_QWORDS)
> +		if (vec_qwords != PERF_X86_XMM_QWORDS &&
> +		    vec_qwords != PERF_X86_YMM_QWORDS)
>  			return -EINVAL;

[Severity: High]
This isn't a bug introduced by this patch, but does the surrounding
validation code in perf_simd_reg_validate() inappropriately reject legal
syscalls?

When simd_enabled is 0, the function checks if vec_qwords, vec_mask_intr,
and other parameters are zero, returning -EINVAL if they are not.

Since userspace isn't required to zero-fill unused arguments outside their
flag gate, these parameters could contain garbage.

Is it possible that enforcing zero-fill on these unused arguments
functionally breaks ABI compatibility by failing requests that should be
valid under the trust boundary rules?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=14

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 15/23] perf/x86: Support ZMM sampling using sample_simd_vec_reg_* fields
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (13 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 14/23] perf/x86: Support YMM " Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 16/23] perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi, Kan Liang

This patch adds support for sampling ZMM registers via the
sample_simd_vec_reg_* fields.

Each ZMM register consists of 8 u64 words. Current x86 hardware supports
up to 32 ZMM registers. For ZMM registers from ZMM0 to ZMM15, they are
assembled from three parts: XMM (the lower 2 u64 words),
YMMH (the middle 2 u64 words), and ZMMH (the upper 4 u64 words). The
perf_simd_reg_value() function is responsible for assembling these three
parts into a complete ZMM register for output to userspace.

For ZMM registers ZMM16 to ZMM31, each register can be read as a whole
and directly outputted to userspace.

Additionally, sample_simd_vec_reg_qwords should be set to 8 to indicate
ZMM sampling.

Please note ZMM sampling is not enabled yet, it will be enabled in a later
patch when PERF_PMU_CAP_SIMD_REGS is set.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 16 ++++++++++++++++
 arch/x86/events/perf_event.h          | 19 +++++++++++++++++++
 arch/x86/include/asm/perf_event.h     |  8 ++++++++
 arch/x86/include/uapi/asm/perf_regs.h |  8 ++++++--
 arch/x86/kernel/perf_regs.c           | 16 +++++++++++++++-
 5 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d39710f42ca0..3051a53232c8 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -735,6 +735,12 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_ymm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
 				return -EINVAL;
+			if (event_needs_low16_zmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_ZMM_Hi256))
+				return -EINVAL;
+			if (event_needs_high16_zmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_Hi16_ZMM))
+				return -EINVAL;
 		}
 	}
 
@@ -1780,6 +1786,8 @@ void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->abi = PERF_SAMPLE_REGS_ABI_NONE;
 	perf_regs->xmm_regs = NULL;
 	perf_regs->ymmh_regs = NULL;
+	perf_regs->zmmh_regs = NULL;
+	perf_regs->h16zmm_regs = NULL;
 }
 
 static void update_perf_regs(struct x86_perf_regs *perf_regs,
@@ -1797,6 +1805,10 @@ static void update_perf_regs(struct x86_perf_regs *perf_regs,
 		perf_regs->xmm_space = xsave->i387.xmm_space;
 	if (mask & XFEATURE_MASK_YMM)
 		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
+	if (mask & XFEATURE_MASK_ZMM_Hi256)
+		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
+	if (mask & XFEATURE_MASK_Hi16_ZMM)
+		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 }
 
 /*
@@ -1975,6 +1987,10 @@ static void x86_pmu_sample_xregs(struct perf_event *event,
 		mask |= XFEATURE_MASK_SSE;
 	if (event_needs_ymm(event))
 		mask |= XFEATURE_MASK_YMM;
+	if (event_needs_low16_zmm(event))
+		mask |= XFEATURE_MASK_ZMM_Hi256;
+	if (event_needs_high16_zmm(event))
+		mask |= XFEATURE_MASK_Hi16_ZMM;
 
 	mask &= x86_pmu.ext_regs_mask;
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5111eaf8b12a..53c5802317bb 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -168,6 +168,25 @@ static inline bool event_needs_ymm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_low16_zmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_ZMM_QWORDS)
+		return true;
+
+	return false;
+}
+
+static inline bool event_needs_high16_zmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+	     fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 1d03b86be65d..273840bd7b33 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -735,6 +735,14 @@ struct x86_perf_regs {
 		u64	*ymmh_regs;
 		struct ymmh_struct *ymmh;
 	};
+	union {
+		u64	*zmmh_regs;
+		struct avx_512_zmm_uppers_state *zmmh;
+	};
+	union {
+		u64	*h16zmm_regs;
+		struct avx_512_hi16_state *h16zmm;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 8f513229fbb8..3aacdd4e2764 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -58,16 +58,20 @@ enum perf_event_x86_regs {
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
 	PERF_X86_SIMD_YMM_REGS      = 16,
-	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_YMM_REGS,
+	PERF_X86_SIMD_ZMM_REGS      = 32,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
 };
 
 #define PERF_X86_SIMD_VEC_MASK	__GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
+#define PERF_X86_H16ZMM_BASE		16
+
 enum {
 	/* 1 qword = 8 bytes */
 	PERF_X86_XMM_QWORDS      = 2,
 	PERF_X86_YMM_QWORDS      = 4,
-	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_YMM_QWORDS,
+	PERF_X86_ZMM_QWORDS      = 8,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
 };
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 9792483360c7..3c28f28de1e6 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -78,6 +78,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 }
 
 #define PERF_X86_YMMH_QWORDS	(PERF_X86_YMM_QWORDS / 2)
+#define PERF_X86_ZMMH_QWORDS	(PERF_X86_ZMM_QWORDS / 2)
 
 u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			u16 qwords_idx, bool pred)
@@ -95,6 +96,13 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
 		return 0;
 
+	if (idx >= PERF_X86_H16ZMM_BASE) {
+		if (!perf_regs->h16zmm_regs)
+			return 0;
+		return perf_regs->h16zmm_regs[(idx - PERF_X86_H16ZMM_BASE) *
+					PERF_X86_ZMM_QWORDS + qwords_idx];
+	}
+
 	if (qwords_idx < PERF_X86_XMM_QWORDS) {
 		if (!perf_regs->xmm_regs)
 			return 0;
@@ -105,6 +113,11 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			return 0;
 		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS +
 					    qwords_idx - PERF_X86_XMM_QWORDS];
+	} else if (qwords_idx < PERF_X86_ZMM_QWORDS) {
+		if (!perf_regs->zmmh_regs)
+			return 0;
+		return perf_regs->zmmh_regs[idx * PERF_X86_ZMMH_QWORDS +
+					    qwords_idx - PERF_X86_YMM_QWORDS];
 	}
 
 	return 0;
@@ -129,7 +142,8 @@ int perf_simd_reg_validate(u16 simd_enabled, u16 vec_qwords,
 			return -EINVAL;
 	} else {
 		if (vec_qwords != PERF_X86_XMM_QWORDS &&
-		    vec_qwords != PERF_X86_YMM_QWORDS)
+		    vec_qwords != PERF_X86_YMM_QWORDS &&
+		    vec_qwords != PERF_X86_ZMM_QWORDS)
 			return -EINVAL;
 		if ((!vec_mask_intr && !vec_mask_user) ||
 		    (vec_mask_intr & ~PERF_X86_SIMD_VEC_MASK) ||
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Patch v8 16/23] perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (14 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 15/23] perf/x86: Support ZMM " Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  9:21   ` sashiko-bot
  2026-05-29  7:56 ` [Patch v8 17/23] perf: Enhance perf_reg_validate() with simd_enabled argument Dapeng Mi
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi, Kan Liang

This patch adds support for sampling OPMASK registers via the
sample_simd_pred_reg_* fields.

Each OPMASK register consists of 1 u64 word. Current x86 hardware
supports 8 OPMASK registers. The perf_simd_reg_value() function is
responsible for outputting OPMASK value to userspace.

Additionally, sample_simd_pred_reg_qwords should be set to 1 to indicate
OPMASK sampling.

Please note OPMASK sampling is not enabled yet, it will be enabled in a
later patch when PERF_PMU_CAP_SIMD_REGS is set.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                |  8 ++++++++
 arch/x86/events/perf_event.h          | 10 ++++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  5 +++++
 arch/x86/kernel/perf_regs.c           | 23 +++++++++++++++++++----
 5 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 3051a53232c8..d4516d3b5d5a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -741,6 +741,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_high16_zmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_Hi16_ZMM))
 				return -EINVAL;
+			if (event_needs_opmask(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_OPMASK))
+				return -EINVAL;
 		}
 	}
 
@@ -1788,6 +1791,7 @@ void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->ymmh_regs = NULL;
 	perf_regs->zmmh_regs = NULL;
 	perf_regs->h16zmm_regs = NULL;
+	perf_regs->opmask_regs = NULL;
 }
 
 static void update_perf_regs(struct x86_perf_regs *perf_regs,
@@ -1809,6 +1813,8 @@ static void update_perf_regs(struct x86_perf_regs *perf_regs,
 		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
 	if (mask & XFEATURE_MASK_Hi16_ZMM)
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
+	if (mask & XFEATURE_MASK_OPMASK)
+		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 }
 
 /*
@@ -1991,6 +1997,8 @@ static void x86_pmu_sample_xregs(struct perf_event *event,
 		mask |= XFEATURE_MASK_ZMM_Hi256;
 	if (event_needs_high16_zmm(event))
 		mask |= XFEATURE_MASK_Hi16_ZMM;
+	if (event_needs_opmask(event))
+		mask |= XFEATURE_MASK_OPMASK;
 
 	mask &= x86_pmu.ext_regs_mask;
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 53c5802317bb..22b846999cfa 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -187,6 +187,16 @@ static inline bool event_needs_high16_zmm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_opmask(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_simd_pred_reg_intr ||
+	     event->attr.sample_simd_pred_reg_user))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 273840bd7b33..7e8b60bddd5a 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -743,6 +743,10 @@ struct x86_perf_regs {
 		u64	*h16zmm_regs;
 		struct avx_512_hi16_state *h16zmm;
 	};
+	union {
+		u64	*opmask_regs;
+		struct avx_512_opmask_state *opmask;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 3aacdd4e2764..24c981ba8bae 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -60,14 +60,19 @@ enum {
 	PERF_X86_SIMD_YMM_REGS      = 16,
 	PERF_X86_SIMD_ZMM_REGS      = 32,
 	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
+
+	PERF_X86_SIMD_OPMASK_REGS   = 8,
+	PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
 };
 
+#define PERF_X86_SIMD_PRED_MASK	__GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
 #define PERF_X86_SIMD_VEC_MASK	__GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
 #define PERF_X86_H16ZMM_BASE		16
 
 enum {
 	/* 1 qword = 8 bytes */
+	PERF_X86_OPMASK_QWORDS   = 1,
 	PERF_X86_XMM_QWORDS      = 2,
 	PERF_X86_YMM_QWORDS      = 4,
 	PERF_X86_ZMM_QWORDS      = 8,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 3c28f28de1e6..21b282be8ab9 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -89,8 +89,14 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 	if (!(perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD))
 		return 0;
 
-	if (pred)
-		return 0;
+	if (pred) {
+		if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_PRED_REGS_MAX ||
+				 qwords_idx >= PERF_X86_OPMASK_QWORDS))
+			return 0;
+		if (!perf_regs->opmask_regs)
+			return 0;
+		return perf_regs->opmask_regs[idx];
+	}
 
 	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
@@ -151,8 +157,17 @@ int perf_simd_reg_validate(u16 simd_enabled, u16 vec_qwords,
 			return -EINVAL;
 	}
 
-	if (pred_qwords || pred_mask_intr || pred_mask_user)
-		return -EINVAL;
+	if (!pred_qwords) {
+		if (pred_mask_intr || pred_mask_user)
+			return -EINVAL;
+	} else {
+		if (pred_qwords != PERF_X86_OPMASK_QWORDS)
+			return -EINVAL;
+		if ((!pred_mask_intr && !pred_mask_user) ||
+		    (pred_mask_intr & ~PERF_X86_SIMD_PRED_MASK) ||
+		    (pred_mask_user & ~PERF_X86_SIMD_PRED_MASK))
+			return -EINVAL;
+	}
 
 	size = ((vec_qwords * hweight64(vec_mask_intr)) +
 		(vec_qwords * hweight64(vec_mask_user)) +
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 16/23] perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields
  2026-05-29  7:56 ` [Patch v8 16/23] perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
@ 2026-05-29  9:21   ` sashiko-bot
  0 siblings, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  9:21 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] The UAPI header `arch/x86/include/uapi/asm/perf_regs.h` uses the `__GENMASK` macro without including `<linux/bits.h>`.
--

commit 105b880ba965bb46db4e0202af5a989492cd6220
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields

This commit adds support for sampling OPMASK registers via the
sample_simd_pred_reg_* fields. It updates x86_pmu_hw_config to check for
OPMASK capabilities and modifies perf_simd_reg_value and
perf_simd_reg_validate to handle the new registers.

> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
> index 3aacdd4e2764c..24c981ba8bae2 100644
> --- a/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/arch/x86/include/uapi/asm/perf_regs.h
[ ... ]
> @@ -60,14 +60,19 @@ enum {
>  	PERF_X86_SIMD_YMM_REGS      = 16,
>  	PERF_X86_SIMD_ZMM_REGS      = 32,
>  	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
> +
> +	PERF_X86_SIMD_OPMASK_REGS   = 8,
> +	PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
>  };
>  
> +#define PERF_X86_SIMD_PRED_MASK	__GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)

[Severity: Low]
Does this uapi header need to include <linux/bits.h> for the __GENMASK
macro?

Without the include, userspace applications directly including
<asm/perf_regs.h> might fail to compile due to an implicitly defined macro.

>  #define PERF_X86_SIMD_VEC_MASK	__GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=16

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 17/23] perf: Enhance perf_reg_validate() with simd_enabled argument
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (15 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 16/23] perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  7:56 ` [Patch v8 18/23] perf/x86: Support eGPRs sampling using sample_regs_* fields Dapeng Mi
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

The upcoming patch will support x86 APX eGPRs sampling by using the
reclaimed XMM register space to represent eGPRs in sample_regs_* fields.

To differentiate between XMM and eGPRs in sample_regs_* fields, an
additional argument, simd_enabled, is introduced to the
perf_reg_validate() helper. If simd_enabled is set to 1, it indicates
that eGPRs are represented in sample_regs_* fields for the x86 platform;
otherwise, XMM registers are represented.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/arm/kernel/perf_regs.c       | 2 +-
 arch/arm64/kernel/perf_regs.c     | 2 +-
 arch/csky/kernel/perf_regs.c      | 2 +-
 arch/loongarch/kernel/perf_regs.c | 2 +-
 arch/mips/kernel/perf_regs.c      | 2 +-
 arch/parisc/kernel/perf_regs.c    | 2 +-
 arch/powerpc/perf/perf_regs.c     | 2 +-
 arch/riscv/kernel/perf_regs.c     | 2 +-
 arch/s390/kernel/perf_regs.c      | 2 +-
 arch/x86/kernel/perf_regs.c       | 4 ++--
 include/linux/perf_regs.h         | 2 +-
 kernel/events/core.c              | 8 +++++---
 12 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index d575a4c3ca56..838d701adf4d 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_ARM_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index 70e2f13f587f..71a3e0238de4 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -77,7 +77,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_ARM64_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	u64 reserved_mask = REG_RESERVED;
 
diff --git a/arch/csky/kernel/perf_regs.c b/arch/csky/kernel/perf_regs.c
index 94601f37b596..c932a96afc56 100644
--- a/arch/csky/kernel/perf_regs.c
+++ b/arch/csky/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_CSKY_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/loongarch/kernel/perf_regs.c b/arch/loongarch/kernel/perf_regs.c
index 8dd604f01745..164514f40ae0 100644
--- a/arch/loongarch/kernel/perf_regs.c
+++ b/arch/loongarch/kernel/perf_regs.c
@@ -25,7 +25,7 @@ u64 perf_reg_abi(struct task_struct *tsk)
 }
 #endif /* CONFIG_32BIT */
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask)
 		return -EINVAL;
diff --git a/arch/mips/kernel/perf_regs.c b/arch/mips/kernel/perf_regs.c
index 7736d3c5ebd2..00a5201dbd5d 100644
--- a/arch/mips/kernel/perf_regs.c
+++ b/arch/mips/kernel/perf_regs.c
@@ -28,7 +28,7 @@ u64 perf_reg_abi(struct task_struct *tsk)
 }
 #endif /* CONFIG_32BIT */
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask)
 		return -EINVAL;
diff --git a/arch/parisc/kernel/perf_regs.c b/arch/parisc/kernel/perf_regs.c
index b9fe1f2fcb9b..4f21aab5405c 100644
--- a/arch/parisc/kernel/perf_regs.c
+++ b/arch/parisc/kernel/perf_regs.c
@@ -34,7 +34,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_PARISC_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index 350dccb0143c..a01d8a903640 100644
--- a/arch/powerpc/perf/perf_regs.c
+++ b/arch/powerpc/perf/perf_regs.c
@@ -125,7 +125,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/riscv/kernel/perf_regs.c b/arch/riscv/kernel/perf_regs.c
index 3bba8deababb..1ecc8760b88b 100644
--- a/arch/riscv/kernel/perf_regs.c
+++ b/arch/riscv/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_RISCV_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/s390/kernel/perf_regs.c b/arch/s390/kernel/perf_regs.c
index 7b305f1456f8..6496fd23c540 100644
--- a/arch/s390/kernel/perf_regs.c
+++ b/arch/s390/kernel/perf_regs.c
@@ -34,7 +34,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1UL << PERF_REG_S390_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 21b282be8ab9..79803b3b6d6b 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -192,7 +192,7 @@ int perf_simd_reg_validate(u16 simd_enabled, u16 vec_qwords,
 		       (1ULL << PERF_REG_X86_R14) | \
 		       (1ULL << PERF_REG_X86_R15))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	/* The mask could be 0 if only the SIMD registers are interested */
 	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
@@ -212,7 +212,7 @@ u64 perf_reg_abi(struct task_struct *task)
 		       (1ULL << PERF_REG_X86_FS) | \
 		       (1ULL << PERF_REG_X86_GS))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	/* The mask could be 0 if only the SIMD registers are interested */
 	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 8fa3eeb14953..496080e12141 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -10,7 +10,7 @@ struct perf_regs {
 };
 
 u64 perf_reg_value(struct pt_regs *regs, int idx);
-int perf_reg_validate(u64 mask);
+int perf_reg_validate(u64 mask, bool simd_enabled);
 u64 perf_reg_abi(struct task_struct *task);
 void perf_get_regs_user(struct perf_regs *regs_user,
 			struct pt_regs *regs);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index afd5b1408231..bf88d9a61f0b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7774,7 +7774,7 @@ u64 __weak perf_reg_value(struct pt_regs *regs, int idx)
 	return 0;
 }
 
-int __weak perf_reg_validate(u64 mask)
+int __weak perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	return mask ? -ENOSYS : 0;
 }
@@ -13743,7 +13743,8 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_REGS_USER) {
-		ret = perf_reg_validate(attr->sample_regs_user);
+		ret = perf_reg_validate(attr->sample_regs_user,
+					attr->sample_simd_regs_enabled);
 		if (ret)
 			return ret;
 		ret = perf_simd_reg_validate(attr->sample_simd_regs_enabled,
@@ -13776,7 +13777,8 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		attr->sample_max_stack = sysctl_perf_event_max_stack;
 
 	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
-		ret = perf_reg_validate(attr->sample_regs_intr);
+		ret = perf_reg_validate(attr->sample_regs_intr,
+					attr->sample_simd_regs_enabled);
 		if (ret)
 			return ret;
 		ret = perf_simd_reg_validate(attr->sample_simd_regs_enabled,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Patch v8 18/23] perf/x86: Support eGPRs sampling using sample_regs_* fields
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (16 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 17/23] perf: Enhance perf_reg_validate() with simd_enabled argument Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  9:31   ` sashiko-bot
  2026-05-29  7:56 ` [Patch v8 19/23] perf/x86: Support SSP " Dapeng Mi
                   ` (5 subsequent siblings)
  23 siblings, 1 reply; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi, Kan Liang

This patch supports sampling of APX eGPRs (R16 ~ R31) via the
sample_regs_* fields.

To sample eGPRs, the sample_simd_regs_enabled field must be set. This
allows the spare space (reclaimed from the original XMM space) in the
sample_regs_* fields to be used for representing eGPRs.

The perf_reg_value() function needs to check if the
PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then determine whether
to output eGPRs or legacy XMM registers to userspace.

The perf_reg_validate() function first checks the simd_enabled argument
to determine if the eGPRs bitmap is represented in sample_regs_* fields.
It then validates the eGPRs bitmap accordingly.

Currently, eGPRs sampling is only supported on the x86_64 architecture, as
APX is only available on x86_64 platforms.

Please note eGPRs sampling is not enabled yet, it will be enabled in a
later patch when PERF_PMU_CAP_SIMD_REGS is set.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 43 ++++++++++++++++++---------
 arch/x86/events/intel/core.c          |  4 ++-
 arch/x86/events/perf_event.h          | 10 +++++++
 arch/x86/include/asm/perf_event.h     |  4 +++
 arch/x86/include/uapi/asm/perf_regs.h | 26 ++++++++++++++++
 arch/x86/kernel/perf_regs.c           | 43 ++++++++++++++++-----------
 6 files changed, 98 insertions(+), 32 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d4516d3b5d5a..af874ff3d048 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -708,26 +708,24 @@ int x86_pmu_hw_config(struct perf_event *event)
 	}
 
 	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
-		/*
-		 * Besides the general purpose registers, XMM registers may
-		 * be collected as well.
-		 */
-		if (event_has_extended_regs(event)) {
-			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
-				return -EINVAL;
-			if (is_sampling_event(event) && !event->attr.precise_ip &&
-			    !this_cpu_has(X86_FEATURE_XSAVES))
-				return -EINVAL;
-			if (event->attr.sample_simd_regs_enabled)
-				return -EINVAL;
-		}
-
 		if (event_has_simd_regs(event)) {
+			u64 reserved = ~GENMASK_ULL(PERF_REG_MISC_MAX - 1, 0);
+
 			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
 				return -EINVAL;
 			if (is_sampling_event(event) && !event->attr.precise_ip &&
 			    !this_cpu_has(X86_FEATURE_XSAVES))
 				return -EINVAL;
+			/*
+			 * The XMM space in the perf_event_x86_regs is reclaimed
+			 * for eGPRs and other general registers.
+			 */
+			if ((event->attr.sample_regs_user & reserved) ||
+			    (event->attr.sample_regs_intr & reserved))
+				return -EINVAL;
+			if (event_needs_egprs(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
+				return -EINVAL;
 			/* The vector registers set is not supported */
 			if (event_needs_xmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
@@ -744,6 +742,18 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_opmask(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_OPMASK))
 				return -EINVAL;
+		} else {
+			/*
+			 * Besides the general purpose registers, XMM registers may
+			 * be collected as well.
+			 */
+			if (event_has_extended_regs(event)) {
+				if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+					return -EINVAL;
+				if (is_sampling_event(event) && !event->attr.precise_ip &&
+				    !this_cpu_has(X86_FEATURE_XSAVES))
+					return -EINVAL;
+			}
 		}
 	}
 
@@ -1792,6 +1802,7 @@ void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->zmmh_regs = NULL;
 	perf_regs->h16zmm_regs = NULL;
 	perf_regs->opmask_regs = NULL;
+	perf_regs->egpr_regs = NULL;
 }
 
 static void update_perf_regs(struct x86_perf_regs *perf_regs,
@@ -1815,6 +1826,8 @@ static void update_perf_regs(struct x86_perf_regs *perf_regs,
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 	if (mask & XFEATURE_MASK_OPMASK)
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
+	if (mask & XFEATURE_MASK_APX)
+		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
 }
 
 /*
@@ -1999,6 +2012,8 @@ static void x86_pmu_sample_xregs(struct perf_event *event,
 		mask |= XFEATURE_MASK_Hi16_ZMM;
 	if (event_needs_opmask(event))
 		mask |= XFEATURE_MASK_OPMASK;
+	if (event_needs_egprs(event))
+		mask |= XFEATURE_MASK_APX;
 
 	mask &= x86_pmu.ext_regs_mask;
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 6c06558c416f..a2473f962681 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4698,7 +4698,9 @@ static void intel_pebs_aliases_skl(struct perf_event *event)
 static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.large_pebs_flags;
-	u64 gprs_mask = PEBS_GP_REGS | PERF_REG_EXTENDED_MASK;
+	u64 gprs_mask = event->attr.sample_simd_regs_enabled ?
+			PEBS_GP_REGS :
+			PEBS_GP_REGS | PERF_REG_EXTENDED_MASK;
 
 	if (event->attr.use_clockid)
 		flags &= ~PERF_SAMPLE_TIME;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 22b846999cfa..4cc490aa04fc 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -197,6 +197,16 @@ static inline bool event_needs_opmask(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_egprs(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_regs_user & PERF_X86_EGPRS_MASK ||
+	     event->attr.sample_regs_intr & PERF_X86_EGPRS_MASK))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 7e8b60bddd5a..a54ea8fa6a04 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -747,6 +747,10 @@ struct x86_perf_regs {
 		u64	*opmask_regs;
 		struct avx_512_opmask_state *opmask;
 	};
+	union {
+		u64	*egpr_regs;
+		struct apx_state *egpr;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 24c981ba8bae..8774a1290fbe 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,34 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/*
+	 * The eGPRs and XMM have overlaps. Only one can be used
+	 * at a time. The ABI PERF_SAMPLE_REGS_ABI_SIMD is used to
+	 * distinguish which one is used. If PERF_SAMPLE_REGS_ABI_SIMD
+	 * is set, then eGPRs is used, otherwise, XMM is used.
+	 *
+	 * Extended GPRs (eGPRs)
+	 */
+	PERF_REG_X86_R16,
+	PERF_REG_X86_R17,
+	PERF_REG_X86_R18,
+	PERF_REG_X86_R19,
+	PERF_REG_X86_R20,
+	PERF_REG_X86_R21,
+	PERF_REG_X86_R22,
+	PERF_REG_X86_R23,
+	PERF_REG_X86_R24,
+	PERF_REG_X86_R25,
+	PERF_REG_X86_R26,
+	PERF_REG_X86_R27,
+	PERF_REG_X86_R28,
+	PERF_REG_X86_R29,
+	PERF_REG_X86_R30,
+	PERF_REG_X86_R31,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	PERF_REG_MISC_MAX = PERF_REG_X86_R31 + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
@@ -54,6 +79,7 @@ enum perf_event_x86_regs {
 };
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
+#define PERF_X86_EGPRS_MASK	__GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
 
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 79803b3b6d6b..006883ad443d 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -61,14 +61,24 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
 	struct x86_perf_regs *perf_regs;
 
-	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+	if (idx > PERF_REG_X86_R15) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
-		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
-		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
+		if (perf_regs->abi == PERF_SAMPLE_REGS_ABI_NONE)
 			return 0;
-		if (!perf_regs->xmm_regs)
-			return 0;
-		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+			if (idx <= PERF_REG_X86_R31) {
+				if (!perf_regs->egpr_regs)
+					return 0;
+				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
+			}
+		} else {
+			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+				if (!perf_regs->xmm_regs)
+					return 0;
+				return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+			}
+		}
 	}
 
 	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
@@ -179,18 +189,12 @@ int perf_simd_reg_validate(u16 simd_enabled, u16 vec_qwords,
 	return 0;
 }
 
-#define PERF_REG_X86_RESERVED	(((1ULL << PERF_REG_X86_XMM0) - 1) & \
-				 ~((1ULL << PERF_REG_X86_MAX) - 1))
+#define PERF_REG_X86_RESERVED	(GENMASK_ULL(PERF_REG_X86_XMM0 - 1, PERF_REG_X86_AX) & \
+				 ~GENMASK_ULL(PERF_REG_X86_R15, PERF_REG_X86_AX))
+#define PERF_REG_X86_EXT_RESERVED	(~GENMASK_ULL(PERF_REG_MISC_MAX - 1, PERF_REG_X86_AX))
 
 #ifdef CONFIG_X86_32
-#define REG_NOSUPPORT ((1ULL << PERF_REG_X86_R8) | \
-		       (1ULL << PERF_REG_X86_R9) | \
-		       (1ULL << PERF_REG_X86_R10) | \
-		       (1ULL << PERF_REG_X86_R11) | \
-		       (1ULL << PERF_REG_X86_R12) | \
-		       (1ULL << PERF_REG_X86_R13) | \
-		       (1ULL << PERF_REG_X86_R14) | \
-		       (1ULL << PERF_REG_X86_R15))
+#define REG_NOSUPPORT GENMASK_ULL(PERF_REG_X86_R15, PERF_REG_X86_R8)
 
 int perf_reg_validate(u64 mask, bool simd_enabled)
 {
@@ -214,8 +218,13 @@ u64 perf_reg_abi(struct task_struct *task)
 
 int perf_reg_validate(u64 mask, bool simd_enabled)
 {
+	if (!simd_enabled &&
+	    (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))))
+		return -EINVAL;
+
 	/* The mask could be 0 if only the SIMD registers are interested */
-	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
+	if (simd_enabled &&
+	    (mask & (REG_NOSUPPORT | PERF_REG_X86_EXT_RESERVED)))
 		return -EINVAL;
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 18/23] perf/x86: Support eGPRs sampling using sample_regs_* fields
  2026-05-29  7:56 ` [Patch v8 18/23] perf/x86: Support eGPRs sampling using sample_regs_* fields Dapeng Mi
@ 2026-05-29  9:31   ` sashiko-bot
  0 siblings, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  9:31 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:

New issues:
- [Critical] The patch unconditionally rejects a mask of 0 when simd_enabled is false, breaking userspace ABI.
- [High] The patch evaluates sample_regs_user and sample_regs_intr without verifying their respective flag gates are set.
- [High] The 32-bit implementation of perf_reg_validate() is not updated to handle extended register masks, allowing a DoS via WARN_ON_ONCE.

Pre-existing issues:
- [Critical] perf_reg_value() blindly casts the raw pt_regs pointer to struct x86_perf_regs, leading to an out-of-bounds read for software events.
- [Critical] perf_reg_validate() fails to mask out bits >= 32 when simd_enabled is false, allowing out-of-bounds register indices.
--

commit bffe5fdaa4de39f9a6532b899d96502d48077d9a
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86: Support eGPRs sampling using sample_regs_* fields

This patch adds support for sampling APX eGPRs (R16 ~ R31) by reclaiming
spare space in the sample_regs_* fields when sample_simd_regs_enabled is set.
It updates perf_reg_value() and perf_reg_validate() to handle the new eGPRs
bitmap alongside legacy XMM registers.

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
[ ... ]
> @@ -708,26 +708,24 @@ int x86_pmu_hw_config(struct perf_event *event)
[ ... ]
>  		if (event_has_simd_regs(event)) {
> +			u64 reserved = ~GENMASK_ULL(PERF_REG_MISC_MAX - 1, 0);
> +
>  			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
>  				return -EINVAL;
>  			if (is_sampling_event(event) && !event->attr.precise_ip &&
>  			    !this_cpu_has(X86_FEATURE_XSAVES))
>  				return -EINVAL;
> +			/*
> +			 * The XMM space in the perf_event_x86_regs is reclaimed
> +			 * for eGPRs and other general registers.
> +			 */
> +			if ((event->attr.sample_regs_user & reserved) ||
> +			    (event->attr.sample_regs_intr & reserved))
> +				return -EINVAL;

[Severity: High]
Is it safe to evaluate both sample_regs_user and sample_regs_intr here without
first verifying their respective flag gates are set?

If a user only sets one of the flags (e.g., PERF_SAMPLE_REGS_USER), does the
other field contain uninitialized garbage from userspace? If that garbage
happens to overlap with the reserved mask, it seems this could trigger
validation failures and cause valid syscalls to randomly fail.

[ ... ]
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -197,6 +197,16 @@ static inline bool event_needs_opmask(struct perf_event *event)
>  	return false;
>  }
>  
> +static inline bool event_needs_egprs(struct perf_event *event)
> +{
> +	if (event->attr.sample_simd_regs_enabled &&
> +	    (event->attr.sample_regs_user & PERF_X86_EGPRS_MASK ||
> +	     event->attr.sample_regs_intr & PERF_X86_EGPRS_MASK))
> +		return true;
> +
> +	return false;
> +}

[Severity: High]
Similarly to x86_pmu_hw_config(), does this evaluate sample_regs_user and
sample_regs_intr without verifying their respective flag gates are set?

Could uninitialized data from userspace falsely trigger this condition?

[ ... ]
> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
> --- a/arch/x86/kernel/perf_regs.c
> +++ b/arch/x86/kernel/perf_regs.c
> @@ -61,14 +61,24 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>  {
>  	struct x86_perf_regs *perf_regs;
>  
> -	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
> +	if (idx > PERF_REG_X86_R15) {
>  		perf_regs = container_of(regs, struct x86_perf_regs, regs);
> -		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
> -		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
>  			return 0;
> -		if (!perf_regs->xmm_regs)
> -			return 0;
> -		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
> +
> +		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {

[Severity: Critical]
This is a pre-existing issue, but does casting the raw pt_regs pointer to
struct x86_perf_regs lead to an out-of-bounds read for software events?

For software events, the generic perf code passes a standard pt_regs pointer.
Because the x86_perf_regs wrapper is not allocated for software events, using
container_of() and reading perf_regs->abi appears to access memory past the
end of the pt_regs struct.

[ ... ]
>  #ifdef CONFIG_X86_32
> -#define REG_NOSUPPORT ((1ULL << PERF_REG_X86_R8) | \
> -		       (1ULL << PERF_REG_X86_R9) | \
> -		       (1ULL << PERF_REG_X86_R10) | \
> -		       (1ULL << PERF_REG_X86_R11) | \
> -		       (1ULL << PERF_REG_X86_R12) | \
> -		       (1ULL << PERF_REG_X86_R13) | \
> -		       (1ULL << PERF_REG_X86_R14) | \
> -		       (1ULL << PERF_REG_X86_R15))
> +#define REG_NOSUPPORT GENMASK_ULL(PERF_REG_X86_R15, PERF_REG_X86_R8)
>  
>  int perf_reg_validate(u64 mask, bool simd_enabled)

[Severity: High]
Does the 32-bit implementation of perf_reg_validate() need to be updated to
handle extended register masks?

It looks like the 64-bit version was updated to reject bits 32-63 when
simd_enabled is true, but the CONFIG_X86_32 block was missed. If a 32-bit
user sets sample_simd_regs_enabled and requests bit 32, could this fall
through to the WARN_ON_ONCE() in perf_reg_value() and allow a denial of
service?

[ ... ]
> @@ -214,8 +218,13 @@ u64 perf_reg_abi(struct task_struct *task)
>  
>  int perf_reg_validate(u64 mask, bool simd_enabled)
>  {
> +	if (!simd_enabled &&
> +	    (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))))
> +		return -EINVAL;

[Severity: Critical]
Does rejecting a mask of 0 when simd_enabled is false break the userspace ABI?

Profiling tools sometimes set sample_regs_user = 0 to obtain the abi indicator
field in the perf ring buffer without dumping any specific registers.

[Severity: Critical]
This is a pre-existing issue, but does this check fail to mask out bits >= 32
when simd_enabled is false?

PERF_REG_X86_RESERVED only covers bits 16-31. It appears to ignore bits 32-63.
Could an unprivileged user request bit 63 for a software event, bypassing
validation, and eventually hitting the WARN_ON_ONCE() and reading out of
bounds in perf_reg_value()?

>  	/* The mask could be 0 if only the SIMD registers are interested */
> -	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
> +	if (simd_enabled &&
> +	    (mask & (REG_NOSUPPORT | PERF_REG_X86_EXT_RESERVED)))
>  		return -EINVAL;
>  
>  	return 0;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=18

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 19/23] perf/x86: Support SSP sampling using sample_regs_* fields
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (17 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 18/23] perf/x86: Support eGPRs sampling using sample_regs_* fields Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29 10:03   ` sashiko-bot
  2026-05-29  7:56 ` [Patch v8 20/23] perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
                   ` (4 subsequent siblings)
  23 siblings, 1 reply; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi, Kan Liang

This patch supports sampling of CET SSP register via the sample_regs_*
fields.

To sample SSP, the sample_simd_regs_enabled field must be set. This
allows the spare space (reclaimed from the original XMM space) in the
sample_regs_* fields to be used for representing SSP.

Similar with eGPRs sampling, the perf_reg_value() function needs to
check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then
determine whether to output SSP or legacy XMM registers to userspace.

Additionally, arch-PEBS supports sampling SSP, which is placed into the
GPRs group. This patch also enables arch-PEBS-based SSP sampling.

Currently, SSP sampling is only supported on the x86_64 architecture, as
CET is only available on x86_64 platforms.

Please note SSP sampling is not enabled yet, it will be enabled in a later
patch when PERF_PMU_CAP_SIMD_REGS is set.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 11 +++++++++++
 arch/x86/events/intel/ds.c            | 15 +++++++++++++--
 arch/x86/events/perf_event.h          | 10 ++++++++++
 arch/x86/include/asm/perf_event.h     |  1 +
 arch/x86/include/uapi/asm/perf_regs.h |  7 ++++---
 arch/x86/kernel/perf_regs.c           |  5 +++++
 6 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index af874ff3d048..f990256fb2ff 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -726,6 +726,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_egprs(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
 				return -EINVAL;
+			if (event_needs_ssp(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_CET_USER))
+				return -EINVAL;
 			/* The vector registers set is not supported */
 			if (event_needs_xmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
@@ -1803,11 +1806,13 @@ void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->h16zmm_regs = NULL;
 	perf_regs->opmask_regs = NULL;
 	perf_regs->egpr_regs = NULL;
+	perf_regs->ssp = NULL;
 }
 
 static void update_perf_regs(struct x86_perf_regs *perf_regs,
 			     struct xregs_state *xsave, u64 bitmap)
 {
+	struct cet_user_state *cet;
 	u64 mask;
 
 	if (!xsave)
@@ -1828,6 +1833,10 @@ static void update_perf_regs(struct x86_perf_regs *perf_regs,
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 	if (mask & XFEATURE_MASK_APX)
 		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
+	if (mask & XFEATURE_MASK_CET_USER) {
+		cet = get_xsave_addr(xsave, XFEATURE_CET_USER);
+		perf_regs->ssp = cet ? &cet->user_ssp : NULL;
+	}
 }
 
 /*
@@ -2014,6 +2023,8 @@ static void x86_pmu_sample_xregs(struct perf_event *event,
 		mask |= XFEATURE_MASK_OPMASK;
 	if (event_needs_egprs(event))
 		mask |= XFEATURE_MASK_APX;
+	if (event_needs_ssp(event))
+		mask |= XFEATURE_MASK_CET_USER;
 
 	mask &= x86_pmu.ext_regs_mask;
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 609d4a83115d..fb393be13fcb 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1723,6 +1723,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	u64 sample_type = attr->sample_type;
 	u64 pebs_data_cfg = 0;
 	bool gprs, tsx_weight;
+	u64 xgprs_mask;
 
 	if (!(sample_type & ~(PERF_SAMPLE_IP|PERF_SAMPLE_TIME)) &&
 	    attr->precise_ip > 1)
@@ -1737,10 +1738,13 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	 * + precise_ip < 2 for the non event IP
 	 * + For RTM TSX weight we need GPRs for the abort code.
 	 */
+	xgprs_mask = event->attr.sample_simd_regs_enabled ?
+		     PEBS_GP_REGS | BIT_ULL(PERF_REG_X86_SSP) :
+		     PEBS_GP_REGS;
 	gprs = ((sample_type & PERF_SAMPLE_REGS_INTR) &&
-		(attr->sample_regs_intr & PEBS_GP_REGS)) ||
+		(attr->sample_regs_intr & xgprs_mask)) ||
 	       ((sample_type & PERF_SAMPLE_REGS_USER) &&
-		(attr->sample_regs_user & PEBS_GP_REGS));
+		(attr->sample_regs_user & xgprs_mask));
 
 	tsx_weight = (sample_type & PERF_SAMPLE_WEIGHT_TYPE) &&
 		     ((attr->config & INTEL_ARCH_EVENT_MASK) ==
@@ -2690,6 +2694,13 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		__setup_pebs_gpr_group(event, regs,
 				       (struct pebs_gprs *)gprs,
 				       sample_type);
+
+		/* Currently only user space mode enables SSP. */
+		if (user_mode(regs) && (sample_type &
+		    (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))) {
+			perf_regs->ssp = &gprs->ssp;
+			ignore_mask |= XFEATURE_MASK_CET_USER;
+		}
 	}
 
 	if (header->aux) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 4cc490aa04fc..c521a7fbe9c6 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -207,6 +207,16 @@ static inline bool event_needs_egprs(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_ssp(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP) ||
+	     event->attr.sample_regs_intr & BIT_ULL(PERF_REG_X86_SSP)))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index a54ea8fa6a04..2769ec3030e5 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -751,6 +751,7 @@ struct x86_perf_regs {
 		u64	*egpr_regs;
 		struct apx_state *egpr;
 	};
+	u64	*ssp;
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 8774a1290fbe..31a025cb9dba 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -28,10 +28,10 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
 	/*
-	 * The eGPRs and XMM have overlaps. Only one can be used
+	 * The eGPRs/SSP and XMM have overlaps. Only one can be used
 	 * at a time. The ABI PERF_SAMPLE_REGS_ABI_SIMD is used to
 	 * distinguish which one is used. If PERF_SAMPLE_REGS_ABI_SIMD
-	 * is set, then eGPRs is used, otherwise, XMM is used.
+	 * is set, then eGPRs/SSP is used, otherwise, XMM is used.
 	 *
 	 * Extended GPRs (eGPRs)
 	 */
@@ -51,10 +51,11 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R29,
 	PERF_REG_X86_R30,
 	PERF_REG_X86_R31,
+	PERF_REG_X86_SSP,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
-	PERF_REG_MISC_MAX = PERF_REG_X86_R31 + 1,
+	PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 006883ad443d..6f0970ed60db 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -72,6 +72,11 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 					return 0;
 				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
 			}
+			if (idx == PERF_REG_X86_SSP) {
+				if (!perf_regs->ssp)
+					return 0;
+				return *perf_regs->ssp;
+			}
 		} else {
 			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 				if (!perf_regs->xmm_regs)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 19/23] perf/x86: Support SSP sampling using sample_regs_* fields
  2026-05-29  7:56 ` [Patch v8 19/23] perf/x86: Support SSP " Dapeng Mi
@ 2026-05-29 10:03   ` sashiko-bot
  0 siblings, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29 10:03 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The uAPI header addition for PERF_REG_X86_SSP is not synchronized with the perf tool's copy, and the perf tool lacks support for formatting or synthesizing the new register.
--

commit b3c69baea8c8d8c919d96ea56a7caf694d17f469
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86: Support SSP sampling using sample_regs_* fields

This patch supports sampling of the CET SSP register via the sample_regs_*
fields by using spare space reclaimed from the XMM space. It updates the
register extraction logic and enables arch-PEBS-based SSP sampling on
x86_64.

> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
> index 8774a1290fbef..31a025cb9dbaf 100644
> --- a/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/arch/x86/include/uapi/asm/perf_regs.h

[ ... ]

> @@ -51,10 +51,11 @@ enum perf_event_x86_regs {
>  	PERF_REG_X86_R29,
>  	PERF_REG_X86_R30,
>  	PERF_REG_X86_R31,
> +	PERF_REG_X86_SSP,

[Severity: Medium]
Should the corresponding UAPI header for the perf tool in
tools/arch/x86/include/uapi/asm/perf_regs.h be updated in sync with this
change?

Also, does the perf tool require additional updates to support this new
register? Without adding a string representation mapping for SSP, the
tool might not be able to parse '--user-regs=ssp' or print the register
name.

Furthermore, tools/perf/util/intel-pt.c relies on PERF_REG_X86_64_MAX
and a hardcoded pebs_gp_regs array that does not currently include SSP.
Will this cause the SSP register to be silently dropped from synthesized
PT PEBS samples?

>  	/* These are the limits for the GPRs. */
>  	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>  	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
> -	PERF_REG_MISC_MAX = PERF_REG_X86_R31 + 1,
> +	PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
>  
>  	/* These all need two bits set because they are 128bit */
>  	PERF_REG_X86_XMM0  = 32,

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=19

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 20/23] perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (18 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 19/23] perf/x86: Support SSP " Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  9:45   ` sashiko-bot
  2026-05-29  7:56 ` [Patch v8 21/23] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

This patch supports arch-PEBS based SIMD/eGPRs/SSP registers sampling.

Arch-PEBS supports sampling of these registers, with all except SSP
placed into the XSAVE-Enabled Registers (XER) group with the layout
described below.

Field Name 	Registers Used 			Size

XSTATE_BV	XINUSE for groups		8 B
Reserved 	Reserved 			8 B
SSER 		XMM0-XMM15 			16 regs * 16 B = 256 B
YMMHIR 		Upper 128 bits of YMM0-YMM15 	16 regs * 16 B = 256 B
EGPR 		R16-R31 			16 regs *  8 B = 128 B
OPMASKR 	K0-K7 				 8 regs *  8 B = 64 B
ZMMHIR 		Upper 256 bits of ZMM0-ZMM15 	16 regs * 32 B = 512 B
Hi16ZMMR 	ZMM16-ZMM31 			16 regs * 64 B = 1024 B

Memory space in the output buffer is allocated for these sub-groups as
long as the corresponding Format.XER[55:49] bits in the PEBS record
header are set. However, the arch-PEBS hardware engine does not write
the sub-group if it is not used (in INIT state). In such cases, the
corresponding bit in the XSTATE_BV bitmap is set to 0. Therefore, the
XSTATE_BV field is checked to determine if the register data is actually
written for each PEBS record. If not, the register data is not outputted
to userspace.

The SSP register is sampled and placed into the GPRs group by arch-PEBS.

Additionally, the MSRs IA32_PMC_{GPn|FXm}_CFG_C.[55:49] bits are used to
manage which types of these registers need to be sampled.

Please note arch-PEBS based SIMD/eGPRs/SSP sampling is not enabled yet,
it will be enabled in a later patch when PERF_PMU_CAP_SIMD_REGS is set.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c      | 121 ++++++++++++++++++++++++++++--
 arch/x86/events/intel/ds.c        |  77 +++++++++++++++++--
 arch/x86/include/asm/msr-index.h  |   7 ++
 arch/x86/include/asm/perf_event.h |   8 +-
 4 files changed, 199 insertions(+), 14 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a2473f962681..679781519f8c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3473,6 +3473,21 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
 			if (pebs_data_cfg & PEBS_DATACFG_XMMS)
 				ext |= ARCH_PEBS_VECR_XMM & cap.caps;
 
+			if (pebs_data_cfg & PEBS_DATACFG_YMMHS)
+				ext |= ARCH_PEBS_VECR_YMMH & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_EGPRS)
+				ext |= ARCH_PEBS_VECR_EGPRS & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
+				ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
+				ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
+				ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
+
 			if (pebs_data_cfg & PEBS_DATACFG_LBRS)
 				ext |= ARCH_PEBS_LBR & cap.caps;
 
@@ -4695,21 +4710,113 @@ static void intel_pebs_aliases_skl(struct perf_event *event)
 	return intel_pebs_aliases_precdist(event);
 }
 
+static inline bool intel_pebs_support_regs(struct perf_event *event, u64 regs)
+{
+	struct arch_pebs_cap cap = hybrid(event->pmu, arch_pebs_cap);
+	int pebs_format = x86_pmu.intel_cap.pebs_format;
+	bool supported = true;
+
+	if (regs & PEBS_DATACFG_GP) {
+		/* Legacy PEBS always supports GPRs sampling. */
+		supported &= x86_pmu.arch_pebs ?
+			     !!(ARCH_PEBS_GPR & cap.caps) : true;
+	}
+	if (regs & PEBS_DATACFG_XMMS) {
+		supported &= x86_pmu.arch_pebs ?
+			     !!(ARCH_PEBS_VECR_XMM & cap.caps) :
+			     pebs_format > 3 && x86_pmu.intel_cap.pebs_baseline;
+	}
+	/* Legacy PEBS doesn't support OPMASK/YMM+ and eGPRs sampling. */
+	if (regs & PEBS_DATACFG_YMMHS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_YMMH & cap.caps);
+	if (regs & PEBS_DATACFG_EGPRS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_EGPRS & cap.caps);
+	if (regs & PEBS_DATACFG_OPMASKS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_OPMASK & cap.caps);
+	if (regs & PEBS_DATACFG_ZMMHS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_ZMMH & cap.caps);
+	if (regs & PEBS_DATACFG_H16ZMMS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_H16ZMM & cap.caps);
+
+	return supported;
+}
+
+static bool __regs_support_large_pebs(struct perf_event *event, bool intr)
+{
+	u64 regs = intr ? event->attr.sample_regs_intr :
+			  event->attr.sample_regs_user;
+	u64 vec_regs = intr ? event->attr.sample_simd_vec_reg_intr :
+			      event->attr.sample_simd_vec_reg_user;
+	u64 pred_regs = intr ? event->attr.sample_simd_pred_reg_intr :
+			       event->attr.sample_simd_pred_reg_user;
+	u64 xregs_mask = PEBS_GP_REGS | PERF_X86_EGPRS_MASK |
+			 BIT_ULL(PERF_REG_X86_SSP);
+
+	if (regs & ~xregs_mask)
+		return false;
+
+	if ((regs & (PEBS_GP_REGS | BIT_ULL(PERF_REG_X86_SSP))) &&
+	    !intel_pebs_support_regs(event, PEBS_DATACFG_GP))
+		return false;
+
+	if ((regs & PERF_X86_EGPRS_MASK) &&
+	    !intel_pebs_support_regs(event, PEBS_DATACFG_EGPRS))
+		return false;
+
+	if (event_needs_opmask(event) && pred_regs &&
+	    !intel_pebs_support_regs(event, PEBS_DATACFG_OPMASKS))
+		return false;
+
+	if (event_needs_xmm(event) && vec_regs &&
+	    !intel_pebs_support_regs(event, PEBS_DATACFG_XMMS))
+		return false;
+
+	if (event_needs_ymm(event) && vec_regs &&
+	    !intel_pebs_support_regs(event, PEBS_DATACFG_YMMHS))
+		return false;
+
+	if (event_needs_low16_zmm(event) && vec_regs &&
+	    !intel_pebs_support_regs(event, PEBS_DATACFG_ZMMHS))
+		return false;
+
+	if (event_needs_high16_zmm(event) && vec_regs &&
+	    !intel_pebs_support_regs(event, PEBS_DATACFG_H16ZMMS))
+		return false;
+
+	return true;
+}
+
+static inline bool intr_regs_support_large_pebs(struct perf_event *event)
+{
+	return __regs_support_large_pebs(event, /*intr=*/true);
+}
+
+static inline bool user_regs_support_large_pebs(struct perf_event *event)
+{
+	return __regs_support_large_pebs(event, /*intr=*/false);
+}
+
 static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.large_pebs_flags;
-	u64 gprs_mask = event->attr.sample_simd_regs_enabled ?
-			PEBS_GP_REGS :
-			PEBS_GP_REGS | PERF_REG_EXTENDED_MASK;
 
 	if (event->attr.use_clockid)
 		flags &= ~PERF_SAMPLE_TIME;
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_user & ~gprs_mask)
-		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_intr & ~gprs_mask)
-		flags &= ~PERF_SAMPLE_REGS_INTR;
+	if (event->attr.sample_simd_regs_enabled) {
+		if (!user_regs_support_large_pebs(event))
+			flags &= ~PERF_SAMPLE_REGS_USER;
+		if (!intr_regs_support_large_pebs(event))
+			flags &= ~PERF_SAMPLE_REGS_INTR;
+	} else {
+		u64 gprs_mask = PEBS_GP_REGS | PERF_REG_EXTENDED_MASK;
+
+		if (event->attr.sample_regs_user & ~gprs_mask)
+			flags &= ~PERF_SAMPLE_REGS_USER;
+		if (event->attr.sample_regs_intr & ~gprs_mask)
+			flags &= ~PERF_SAMPLE_REGS_INTR;
+	}
 	return flags;
 }
 
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index fb393be13fcb..8a653edce392 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1750,11 +1750,22 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 		     ((attr->config & INTEL_ARCH_EVENT_MASK) ==
 		      x86_pmu.rtm_abort_event);
 
-	if (gprs || (attr->precise_ip < 2) || tsx_weight)
+	if (gprs || (attr->precise_ip < 2) ||
+	    tsx_weight || event_needs_ssp(event))
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
 	if (event_needs_xmm(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
+	if (x86_pmu.arch_pebs && event_needs_ymm(event))
+		pebs_data_cfg |= PEBS_DATACFG_YMMHS;
+	if (x86_pmu.arch_pebs && event_needs_low16_zmm(event))
+		pebs_data_cfg |= PEBS_DATACFG_ZMMHS;
+	if (x86_pmu.arch_pebs && event_needs_high16_zmm(event))
+		pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
+	if (x86_pmu.arch_pebs && event_needs_opmask(event))
+		pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
+	if (x86_pmu.arch_pebs && event_needs_egprs(event))
+		pebs_data_cfg |= PEBS_DATACFG_EGPRS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
 		/*
@@ -2713,15 +2724,69 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 					   meminfo->tsx_tuning, ax);
 	}
 
-	if (header->xmm) {
+	if (header->xmm || header->ymmh || header->egpr ||
+	    header->opmask || header->zmmh || header->h16zmm) {
+		struct arch_pebs_xer_header *xer_header = next_record;
 		struct pebs_xmm *xmm;
+		struct ymmh_struct *ymmh;
+		struct avx_512_zmm_uppers_state *zmmh;
+		struct avx_512_hi16_state *h16zmm;
+		struct avx_512_opmask_state *opmask;
+		struct apx_state *egpr;
 
 		next_record += sizeof(struct arch_pebs_xer_header);
 
-		ignore_mask |= XFEATURE_MASK_SSE;
-		xmm = next_record;
-		perf_regs->xmm_regs = xmm->xmm;
-		next_record = xmm + 1;
+		if (header->xmm) {
+			ignore_mask |= XFEATURE_MASK_SSE;
+			xmm = next_record;
+			/*
+			 * Only output XMM regs to user space when arch-PEBS
+			 * really writes data into xstate area.
+			 */
+			if (xer_header->xstate & XFEATURE_MASK_SSE)
+				perf_regs->xmm_regs = xmm->xmm;
+			next_record = xmm + 1;
+		}
+
+		if (header->ymmh) {
+			ignore_mask |= XFEATURE_MASK_YMM;
+			ymmh = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_YMM)
+				perf_regs->ymmh = ymmh;
+			next_record = ymmh + 1;
+		}
+
+		if (header->egpr) {
+			ignore_mask |= XFEATURE_MASK_APX;
+			egpr = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_APX)
+				perf_regs->egpr = egpr;
+			next_record = egpr + 1;
+		}
+
+		if (header->opmask) {
+			ignore_mask |= XFEATURE_MASK_OPMASK;
+			opmask = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_OPMASK)
+				perf_regs->opmask = opmask;
+			next_record = opmask + 1;
+		}
+
+		if (header->zmmh) {
+			ignore_mask |= XFEATURE_MASK_ZMM_Hi256;
+			zmmh = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_ZMM_Hi256)
+				perf_regs->zmmh = zmmh;
+			next_record = zmmh + 1;
+		}
+
+		if (header->h16zmm) {
+			ignore_mask |= XFEATURE_MASK_Hi16_ZMM;
+			h16zmm = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_Hi16_ZMM)
+				perf_regs->h16zmm = h16zmm;
+			next_record = h16zmm + 1;
+		}
 	}
 
 	if (header->lbr) {
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a14a0f43e04a..e3b3293aa87f 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -350,6 +350,13 @@
 #define ARCH_PEBS_LBR_SHIFT		40
 #define ARCH_PEBS_LBR			(0x3ull << ARCH_PEBS_LBR_SHIFT)
 #define ARCH_PEBS_VECR_XMM		BIT_ULL(49)
+#define ARCH_PEBS_VECR_YMMH		BIT_ULL(50)
+#define ARCH_PEBS_VECR_EGPRS		BIT_ULL(51)
+#define ARCH_PEBS_VECR_OPMASK		BIT_ULL(53)
+#define ARCH_PEBS_VECR_ZMMH		BIT_ULL(54)
+#define ARCH_PEBS_VECR_H16ZMM		BIT_ULL(55)
+#define ARCH_PEBS_VECR_EXT_SHIFT	50
+#define ARCH_PEBS_VECR_EXT		(0x3full << ARCH_PEBS_VECR_EXT_SHIFT)
 #define ARCH_PEBS_GPR			BIT_ULL(61)
 #define ARCH_PEBS_AUX			BIT_ULL(62)
 #define ARCH_PEBS_EN			BIT_ULL(63)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 2769ec3030e5..bbbe0835ac55 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -148,6 +148,11 @@
 #define PEBS_DATACFG_LBRS	BIT_ULL(3)
 #define PEBS_DATACFG_CNTR	BIT_ULL(4)
 #define PEBS_DATACFG_METRICS	BIT_ULL(5)
+#define PEBS_DATACFG_YMMHS	BIT_ULL(6)
+#define PEBS_DATACFG_OPMASKS	BIT_ULL(7)
+#define PEBS_DATACFG_ZMMHS	BIT_ULL(8)
+#define PEBS_DATACFG_H16ZMMS	BIT_ULL(9)
+#define PEBS_DATACFG_EGPRS	BIT_ULL(10)
 #define PEBS_DATACFG_LBR_SHIFT	24
 #define PEBS_DATACFG_CNTR_SHIFT	32
 #define PEBS_DATACFG_CNTR_MASK	GENMASK_ULL(15, 0)
@@ -545,7 +550,8 @@ struct arch_pebs_header {
 			    rsvd3:7,
 			    xmm:1,
 			    ymmh:1,
-			    rsvd4:2,
+			    egpr:1,
+			    rsvd4:1,
 			    opmask:1,
 			    zmmh:1,
 			    h16zmm:1,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 20/23] perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling
  2026-05-29  7:56 ` [Patch v8 20/23] perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
@ 2026-05-29  9:45   ` sashiko-bot
  0 siblings, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  9:45 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Requesting the SSP (Shadow Stack Pointer) register improperly allows large PEBS to be enabled on legacy PEBS hardware, resulting in stale and identical SSP values across all samples in the buffer.
--

commit 2a27d3f31fac15c7d594a6588f8dacfa2939bc9f
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling

This commit adds support for sampling SIMD, eGPRs, and SSP registers using
arch-PEBS. It defines the layout for these extended registers and manages
their sampling via MSRs and the XSTATE_BV bitmap.

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c

[ ... ]

> @@ -4695,21 +4710,113 @@ static void intel_pebs_aliases_skl(struct perf_event *event)

[ ... ]

> +static bool __regs_support_large_pebs(struct perf_event *event, bool intr)
> +{
> +	u64 regs = intr ? event->attr.sample_regs_intr :
> +			  event->attr.sample_regs_user;
> +	u64 vec_regs = intr ? event->attr.sample_simd_vec_reg_intr :
> +			      event->attr.sample_simd_vec_reg_user;
> +	u64 pred_regs = intr ? event->attr.sample_simd_pred_reg_intr :
> +			       event->attr.sample_simd_pred_reg_user;
> +	u64 xregs_mask = PEBS_GP_REGS | PERF_X86_EGPRS_MASK |
> +			 BIT_ULL(PERF_REG_X86_SSP);
> +
> +	if (regs & ~xregs_mask)
> +		return false;
> +
> +	if ((regs & (PEBS_GP_REGS | BIT_ULL(PERF_REG_X86_SSP))) &&
> +	    !intel_pebs_support_regs(event, PEBS_DATACFG_GP))
> +		return false;

[Severity: High]
Does this logic unintentionally allow large PEBS to be enabled on legacy PEBS
hardware when the SSP register is requested?

Looking at intel_pebs_support_regs(), it returns true for PEBS_DATACFG_GP
unconditionally on legacy PEBS hardware. If a user requests SSP on legacy
PEBS, __regs_support_large_pebs() will return true and allow large PEBS.

Since legacy PEBS hardware records do not contain the SSP, the perf core
fetches it from the thread's XSAVE buffer during the PMU interrupt via
x86_pmu_update_perf_regs(). 

Because large PEBS batches multiple samples and delays the interrupt, won't
every sample in the batch end up with the exact same stale SSP value from
the time of the interrupt?

Should large PEBS be explicitly disabled on legacy PEBS when SSP is
requested, perhaps similarly to how eGPRs are handled?

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=20

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 21/23] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (19 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 20/23] perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29 10:43   ` sashiko-bot
  2026-05-29  7:56 ` [Patch v8 22/23] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
                   ` (2 subsequent siblings)
  23 siblings, 1 reply; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Enable the PERF_PMU_CAP_SIMD_REGS capability if XSAVES support is
available for YMM, ZMM, OPMASK, eGPRs, or SSP.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 679781519f8c..eef5d116aa06 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -6316,6 +6316,26 @@ static void intel_extended_regs_init(struct pmu *pmu)
 	 */
 	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
 	dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+
+	if (boot_cpu_has(X86_FEATURE_AVX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_YMM, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_YMM;
+	if (boot_cpu_has(X86_FEATURE_APX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_APX, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_APX;
+	if (boot_cpu_has(X86_FEATURE_AVX512F)) {
+		if (cpu_has_xfeatures(XFEATURE_MASK_OPMASK, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_OPMASK;
+		if (cpu_has_xfeatures(XFEATURE_MASK_ZMM_Hi256, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_ZMM_Hi256;
+		if (cpu_has_xfeatures(XFEATURE_MASK_Hi16_ZMM, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_Hi16_ZMM;
+	}
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_CET_USER;
+
+	if (x86_pmu.ext_regs_mask != XFEATURE_MASK_SSE)
+		dest_pmu->capabilities |= PERF_PMU_CAP_SIMD_REGS;
 }
 
 #define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 21/23] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
  2026-05-29  7:56 ` [Patch v8 21/23] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
@ 2026-05-29 10:43   ` sashiko-bot
  0 siblings, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29 10:43 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] The new SIMD registers capability and mask initialization are skipped for older PEBS V4+ architectures (like Icelake, Sapphire Rapids, Alder Lake, and Raptor Lake), leaving them without support for YMM, ZMM, eGPRs, and SSP sampling.
- [Medium] The logic to enable PERF_PMU_CAP_SIMD_REGS incorrectly couples the capability to the presence of non-SIMD extended features like CET_USER (SSP) and APX (eGPRs).
--

commit f179abf9419fef51aa84e8c413e619d7004c64c2
Author: Kan Liang <kan.liang@linux.intel.com>

perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability

Enable the PERF_PMU_CAP_SIMD_REGS capability if XSAVES support is
available for YMM, ZMM, OPMASK, eGPRs, or SSP.

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 679781519f8cf..eef5d116aa068 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -6316,6 +6316,26 @@ static void intel_extended_regs_init(struct pmu *pmu)
>  	 */
>  	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
>  	dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;

[Severity: High]
Does this skip the new SIMD capabilities on older PEBS V4+ architectures?

The initialization for YMM, ZMM, eGPRs, and SSP is only added to
intel_extended_regs_init(). Since intel_extended_regs_init() is called
by update_pmu_cap(), which is guarded by X86_FEATURE_ARCH_PERFMON_EXT,
architectures like Sapphire Rapids and Alder Lake appear to bypass this.

Should this logic also be replicated in intel_ds_pebs_init(), or should
intel_extended_regs_init() be called for those processors so they can
also support the new SIMD capabilities?

> +
> +	if (boot_cpu_has(X86_FEATURE_AVX) &&
> +	    cpu_has_xfeatures(XFEATURE_MASK_YMM, NULL))
> +		x86_pmu.ext_regs_mask |= XFEATURE_MASK_YMM;
> +	if (boot_cpu_has(X86_FEATURE_APX) &&
> +	    cpu_has_xfeatures(XFEATURE_MASK_APX, NULL))
> +		x86_pmu.ext_regs_mask |= XFEATURE_MASK_APX;
> +	if (boot_cpu_has(X86_FEATURE_AVX512F)) {
> +		if (cpu_has_xfeatures(XFEATURE_MASK_OPMASK, NULL))
> +			x86_pmu.ext_regs_mask |= XFEATURE_MASK_OPMASK;
> +		if (cpu_has_xfeatures(XFEATURE_MASK_ZMM_Hi256, NULL))
> +			x86_pmu.ext_regs_mask |= XFEATURE_MASK_ZMM_Hi256;
> +		if (cpu_has_xfeatures(XFEATURE_MASK_Hi16_ZMM, NULL))
> +			x86_pmu.ext_regs_mask |= XFEATURE_MASK_Hi16_ZMM;
> +	}
> +	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		x86_pmu.ext_regs_mask |= XFEATURE_MASK_CET_USER;
> +
> +	if (x86_pmu.ext_regs_mask != XFEATURE_MASK_SSE)
> +		dest_pmu->capabilities |= PERF_PMU_CAP_SIMD_REGS;

[Severity: Medium]
Does this incorrectly couple the SIMD capability to non-SIMD features here?

If a processor supports SSE and CET_USER (SSP), ext_regs_mask will not equal
XFEATURE_MASK_SSE, and PERF_PMU_CAP_SIMD_REGS will be enabled.

However, if a processor only supports SSE, this condition evaluates to false,
and PERF_PMU_CAP_SIMD_REGS is omitted.

Could this regression force userspace tools to fall back to a separate legacy
path to sample XMM registers on SSE-only processors, rather than using the
unified sample_simd_regs interface?

>  }
>  
>  #define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=21

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 22/23] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (20 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 21/23] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  9:34   ` sashiko-bot
  2026-05-29  7:56 ` [Patch v8 23/23] perf/x86/intel: Add sanity check for PEBS fragment size Dapeng Mi
  2026-05-29  8:32 ` [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Mi, Dapeng
  23 siblings, 1 reply; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

When two or more identical PEBS events with the same sampling period are
programmed on a mix of PDIST and non-PDIST counters, multiple
back-to-back NMIs can be triggered.

The Linux PMI handler processes the first NMI and clears the
GLOBAL_STATUS MSR. If a second NMI is triggered immediately after
the first, it is recognized as a "suspicious NMI" because no bits are set
in the GLOBAL_STATUS MSR (cleared by the first NMI).

This issue does not lead to PEBS data corruption or data loss, but it
does result in an annoying warning message.

The current NMI handler supports back-to-back NMI detection, but it
requires the PMI handler to return the count of actually processed events,
which the PEBS handler does not currently do.

This patch modifies the PEBS handlers to return the count of actually
processed events, thereby activating back-to-back NMI detection and
avoiding the "suspicious NMI" warning.

Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c | 29 +++++++++++++++++---------
 arch/x86/events/intel/ds.c   | 40 ++++++++++++++++++++++++------------
 arch/x86/events/perf_event.h |  2 +-
 3 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index eef5d116aa06..4546b20429ba 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3763,7 +3763,7 @@ static void intel_pmu_reset(void)
  *
  * The contents and other behavior of the guest event do not matter.
  */
-static void x86_pmu_handle_guest_pebs(struct pt_regs *regs,
+static int x86_pmu_handle_guest_pebs(struct pt_regs *regs,
 				      struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -3772,11 +3772,11 @@ static void x86_pmu_handle_guest_pebs(struct pt_regs *regs,
 	int bit;
 
 	if (!unlikely(perf_guest_state()))
-		return;
+		return 0;
 
 	if (!x86_pmu.pebs_ept || !x86_pmu.pebs_active ||
 	    !guest_pebs_idxs)
-		return;
+		return 0;
 
 	for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs, X86_PMC_IDX_MAX) {
 		event = cpuc->events[bit];
@@ -3786,9 +3786,14 @@ static void x86_pmu_handle_guest_pebs(struct pt_regs *regs,
 		perf_sample_data_init(data, 0, event->hw.last_period);
 		perf_event_overflow(event, data, regs);
 
-		/* Inject one fake event is enough. */
-		break;
+		/*
+		 * Inject one fake event is enough.
+		 * Returning 1 to inform PMI is handled.
+		 */
+		return 1;
 	}
+
+	return 0;
 }
 
 static int handle_pmi_common(struct pt_regs *regs, u64 status)
@@ -3837,9 +3842,11 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 	if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long *)&status)) {
 		u64 pebs_enabled = cpuc->pebs_enabled;
 
-		handled++;
-		x86_pmu_handle_guest_pebs(regs, &data);
-		static_call(x86_pmu_drain_pebs)(regs, &data);
+		handled += x86_pmu_handle_guest_pebs(regs, &data);
+		handled += static_call(x86_pmu_drain_pebs)(regs, &data);
+		/* Ensure no "suspicious NMI" warning for empty PEBS buffer. */
+		if (!handled)
+			handled++;
 
 		/*
 		 * PMI throttle may be triggered, which stops the PEBS event.
@@ -3866,8 +3873,10 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 	 */
 	if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
 				 (unsigned long *)&status)) {
-		handled++;
-		static_call(x86_pmu_drain_pebs)(regs, &data);
+		handled += static_call(x86_pmu_drain_pebs)(regs, &data);
+		/* Ensure no "suspicious NMI" warning for empty PEBS buffer. */
+		if (!handled)
+			handled++;
 
 		if (cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS] &&
 		    is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS]))
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 8a653edce392..e0d307627702 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -3047,7 +3047,7 @@ __intel_pmu_pebs_events(struct perf_event *event,
 	__intel_pmu_pebs_last_event(event, iregs, regs, data, at, count, setup_sample);
 }
 
-static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
@@ -3056,7 +3056,7 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
 	int n;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	at  = (struct pebs_record_core *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_record_core *)(unsigned long)ds->pebs_index;
@@ -3067,22 +3067,24 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
 	ds->pebs_index = ds->pebs_buffer_base;
 
 	if (!test_bit(0, cpuc->active_mask))
-		return;
+		return 0;
 
 	WARN_ON_ONCE(!event);
 
 	if (!event->attr.precise_ip)
-		return;
+		return 0;
 
 	n = top - at;
 	if (n <= 0) {
 		if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
 			intel_pmu_save_and_restart_reload(event, 0);
-		return;
+		return 0;
 	}
 
 	__intel_pmu_pebs_events(event, iregs, data, at, top, 0, n,
 				setup_pebs_fixed_sample_data);
+
+	return 1; /* PMC0 only*/
 }
 
 static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64 mask)
@@ -3105,7 +3107,7 @@ static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64
 	}
 }
 
-static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
@@ -3114,11 +3116,12 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
 	short error[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
 	int max_pebs_events = intel_pmu_max_num_pebs(NULL);
+	u64 events_bitmap = 0;
 	int bit, i, size;
 	u64 mask;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
@@ -3134,7 +3137,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 
 	if (unlikely(base >= top)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
-		return;
+		return 0;
 	}
 
 	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
@@ -3198,6 +3201,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 		if ((counts[bit] == 0) && (error[bit] == 0))
 			continue;
 
+		events_bitmap |= BIT(bit);
 		event = cpuc->events[bit];
 		if (WARN_ON_ONCE(!event))
 			continue;
@@ -3219,6 +3223,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 						setup_pebs_fixed_sample_data);
 		}
 	}
+
+	return hweight64(events_bitmap);
 }
 
 static __always_inline void
@@ -3272,7 +3278,7 @@ __intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
 
 }
 
-static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
 	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
@@ -3282,10 +3288,11 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 	struct pt_regs *regs = &perf_regs->regs;
 	struct pebs_basic *basic;
 	void *base, *at, *top;
+	u64 events_bitmap = 0;
 	u64 mask;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
@@ -3298,7 +3305,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 
 	if (unlikely(base >= top)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
-		return;
+		return 0;
 	}
 
 	if (!iregs)
@@ -3313,6 +3320,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 			continue;
 
 		pebs_status = mask & basic->applicable_counters;
+		events_bitmap |= pebs_status;
 		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
 					       pebs_status, counts, last,
 					       setup_pebs_adaptive_sample_data);
@@ -3320,9 +3328,11 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 
 	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
 					    setup_pebs_adaptive_sample_data);
+
+	return hweight64(events_bitmap);
 }
 
-static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
+static int intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 				      struct perf_sample_data *data)
 {
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
@@ -3332,13 +3342,14 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 	struct x86_perf_regs *perf_regs = this_cpu_ptr(&x86_pebs_regs);
 	struct pt_regs *regs = &perf_regs->regs;
 	void *base, *at, *top;
+	u64 events_bitmap = 0;
 	u64 mask;
 
 	rdmsrq(MSR_IA32_PEBS_INDEX, index.whole);
 
 	if (unlikely(!index.wr)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
-		return;
+		return 0;
 	}
 
 	base = cpuc->pebs_vaddr;
@@ -3377,6 +3388,7 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 
 		basic = at + sizeof(struct arch_pebs_header);
 		pebs_status = mask & basic->applicable_counters;
+		events_bitmap |= pebs_status;
 		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
 					       pebs_status, counts, last,
 					       setup_arch_pebs_sample_data);
@@ -3396,6 +3408,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask,
 					    counts, last,
 					    setup_arch_pebs_sample_data);
+
+	return hweight64(events_bitmap);
 }
 
 static void __init intel_arch_pebs_init(void)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index c521a7fbe9c6..77bc42f8a070 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1022,7 +1022,7 @@ struct x86_pmu {
 	int		pebs_record_size;
 	int		pebs_buffer_size;
 	u64		pebs_events_mask;
-	void		(*drain_pebs)(struct pt_regs *regs, struct perf_sample_data *data);
+	int		(*drain_pebs)(struct pt_regs *regs, struct perf_sample_data *data);
 	struct event_constraint *pebs_constraints;
 	void		(*pebs_aliases)(struct perf_event *event);
 	u64		(*pebs_latency_data)(struct perf_event *event, u64 status);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 22/23] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2026-05-29  7:56 ` [Patch v8 22/23] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
@ 2026-05-29  9:34   ` sashiko-bot
  0 siblings, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  9:34 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Using the `BIT()` macro for shifts >= 32 on 32-bit architectures causes undefined behavior, defeating the back-to-back NMI detection for fixed counters.
--

commit 67718335973298488df3db65888b9d78109347ad
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs

This commit modifies the PEBS handlers to return the count of actually processed
events. This activates back-to-back NMI detection in the PMI handler, avoiding
the "suspicious NMI" warning that occurs when multiple back-to-back NMIs are
triggered by PEBS events.

> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 8a653edce392a..e0d3076277026 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
[ ... ]
> @@ -3198,6 +3201,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>  		if ((counts[bit] == 0) && (error[bit] == 0))
>  			continue;
>  
> +		events_bitmap |= BIT(bit);
>  		event = cpuc->events[bit];

[Severity: Medium]
Will this shift cause undefined behavior on 32-bit architectures?

Since bit can be greater than or equal to 32 for fixed counters (where
INTEL_PMC_IDX_FIXED is 32), using the BIT() macro on 32-bit x86 kernels
results in a shift by 32 or more bits on an unsigned long, which is undefined
behavior in C.

In practice, this typically results in a modulo shift that aliases fixed
counters to general-purpose counters, causing hweight64(events_bitmap) to
under-report the number of unique events handled. 

This could defeat the back-to-back NMI detection and fail to suppress the
"suspicious NMI" warning for fixed counters on 32-bit systems.

Should this use BIT_ULL(bit) instead?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=22

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Patch v8 23/23] perf/x86/intel: Add sanity check for PEBS fragment size
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (21 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 22/23] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
@ 2026-05-29  7:56 ` Dapeng Mi
  2026-05-29  9:54   ` sashiko-bot
  2026-05-29  8:32 ` [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Mi, Dapeng
  23 siblings, 1 reply; 44+ messages in thread
From: Dapeng Mi @ 2026-05-29  7:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Prevent potential infinite loops by adding a sanity check for the
corrupted PEBS fragment sizes which could happen in theory.

If a corrupted PEBS fragment is detected, the entire PEBS record
including the corrupted fragment and all subsequent records will be
dropped and a NULL PEBS record is reported to user space. This ensures
the integrity of PEBS data and prevents infinite loops in
setup_arch_pebs_sample_data() again.

Please note software has no way to figure out which events are impacted
from the corrupted record, so the last record of each event would be
discarded for all events if corrupted record is detected even though
it may be a well-formed record for some events.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/ds.c | 68 +++++++++++++++++++++++++++-----------
 1 file changed, 49 insertions(+), 19 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index e0d307627702..3e100cb206a8 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2675,6 +2675,9 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 
 again:
 	header = at;
+	if (!header->size)
+		return;
+
 	next_record = at + sizeof(struct arch_pebs_header);
 	if (header->basic) {
 		struct arch_pebs_basic *basic = next_record;
@@ -2827,7 +2830,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 	}
 
 	/* Parse followed fragments if there are. */
-	if (arch_pebs_record_continued(header)) {
+	if (arch_pebs_record_continued(header) && header->size) {
 		at = at + header->size;
 		goto again;
 	}
@@ -2956,13 +2959,21 @@ __intel_pmu_pebs_last_event(struct perf_event *event,
 			    struct pt_regs *iregs,
 			    struct pt_regs *regs,
 			    struct perf_sample_data *data,
-			    void *at,
-			    int count,
+			    void *at, int count, bool corrupted,
 			    setup_fn setup_sample)
 {
 	struct hw_perf_event *hwc = &event->hw;
 
-	setup_sample(event, iregs, at, data, regs);
+	/* Skip parsing corrupted PEBS record. */
+	if (corrupted) {
+		/* Clear stale register states in previous records. */
+		memset(regs, 0, sizeof(*regs));
+		x86_pmu_clear_perf_regs(regs);
+		perf_sample_data_init(data, 0, event->hw.last_period);
+	} else {
+		setup_sample(event, iregs, at, data, regs);
+	}
+
 	if (iregs == &dummy_iregs) {
 		/*
 		 * The PEBS records may be drained in the non-overflow context,
@@ -2980,12 +2991,16 @@ __intel_pmu_pebs_last_event(struct perf_event *event,
 	}
 
 	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
-		if ((is_pebs_counter_event_group(event))) {
-			/*
-			 * The value of each sample has been updated when setup
-			 * the corresponding sample data.
-			 */
-			perf_event_update_userpage(event);
+		if (is_pebs_counter_event_group(event)) {
+			if (corrupted) {
+				intel_pmu_save_and_restart_reload(event, 1);
+			} else {
+				/*
+				 * The value of each sample has been updated
+				 * when setup the corresponding sample data.
+				 */
+				perf_event_update_userpage(event);
+			}
 		} else {
 			/*
 			 * Now, auto-reload is only enabled in fixed period mode.
@@ -3009,7 +3024,7 @@ __intel_pmu_pebs_last_event(struct perf_event *event,
 		 * counters-snapshotting record, only needs to set the new
 		 * period for the counter.
 		 */
-		if (is_pebs_counter_event_group(event))
+		if (is_pebs_counter_event_group(event) && !corrupted)
 			static_call(x86_pmu_set_period)(event);
 		else
 			intel_pmu_save_and_restart(event);
@@ -3038,13 +3053,15 @@ __intel_pmu_pebs_events(struct perf_event *event,
 		iregs = &dummy_iregs;
 
 	while (cnt > 1) {
-		__intel_pmu_pebs_event(event, iregs, regs, data, at, setup_sample);
+		__intel_pmu_pebs_event(event, iregs, regs, data,
+				       at, setup_sample);
 		at += cpuc->pebs_record_size;
 		at = get_next_pebs_record_by_bit(at, top, bit);
 		cnt--;
 	}
 
-	__intel_pmu_pebs_last_event(event, iregs, regs, data, at, count, setup_sample);
+	__intel_pmu_pebs_last_event(event, iregs, regs, data, at,
+				    count, false, setup_sample);
 }
 
 static int intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
@@ -3259,7 +3276,8 @@ static __always_inline void
 __intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
 				    struct pt_regs *regs,
 				    struct perf_sample_data *data,
-				    u64 mask, short *counts, void **last,
+				    u64 mask, short *counts,
+				    void **last, bool corrupted,
 				    setup_fn setup_sample)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -3273,7 +3291,7 @@ __intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
 		event = cpuc->events[bit];
 
 		__intel_pmu_pebs_last_event(event, iregs, regs, data, last[bit],
-					    counts[bit], setup_sample);
+					    counts[bit], corrupted, setup_sample);
 	}
 
 }
@@ -3327,7 +3345,7 @@ static int intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_da
 	}
 
 	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
-					    setup_pebs_adaptive_sample_data);
+					    false, setup_pebs_adaptive_sample_data);
 
 	return hweight64(events_bitmap);
 }
@@ -3343,6 +3361,7 @@ static int intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 	struct pt_regs *regs = &perf_regs->regs;
 	void *base, *at, *top;
 	u64 events_bitmap = 0;
+	bool corrupted = false;
 	u64 mask;
 
 	rdmsrq(MSR_IA32_PEBS_INDEX, index.whole);
@@ -3377,8 +3396,10 @@ static int intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 
 		header = at;
 
-		if (WARN_ON_ONCE(!header->size))
-			break;
+		if (WARN_ON_ONCE(!header->size)) {
+			corrupted = true;
+			goto done;
+		}
 
 		/* 1st fragment or single record must have basic group */
 		if (!header->basic) {
@@ -3398,15 +3419,24 @@ static int intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 			if (!header->size)
 				break;
 			at += header->size;
+			if (WARN_ON_ONCE(at >= top)) {
+				corrupted = true;
+				goto done;
+			}
 			header = at;
 		}
 
 		/* Skip last fragment or the single record */
 		at += header->size;
+		if (WARN_ON_ONCE(at > top)) {
+			corrupted = true;
+			goto done;
+		}
 	}
 
+done:
 	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask,
-					    counts, last,
+					    counts, last, corrupted,
 					    setup_arch_pebs_sample_data);
 
 	return hweight64(events_bitmap);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Patch v8 23/23] perf/x86/intel: Add sanity check for PEBS fragment size
  2026-05-29  7:56 ` [Patch v8 23/23] perf/x86/intel: Add sanity check for PEBS fragment size Dapeng Mi
@ 2026-05-29  9:54   ` sashiko-bot
  0 siblings, 0 replies; 44+ messages in thread
From: sashiko-bot @ 2026-05-29  9:54 UTC (permalink / raw)
  To: Dapeng Mi; +Cc: linux-perf-users

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

New issues:
- [High] Skipping setup_sample on corrupted PEBS records leaves fields in perf_sample_data uninitialized, causing a kernel stack memory leak when the dummy sample is emitted.

Pre-existing issues:
- [High] A pre-existing infinite loop vulnerability for corrupted PEBS records with a size of 0 was missed in intel_pmu_drain_pebs_icl().
- [High] Incomplete validation of corrupted PEBS fragment sizes allows out-of-bounds memory accesses.
--

commit 68a663ba619b4d0d98aedabd45c1aa50b4acbe81
Author: Dapeng Mi <dapeng1.mi@linux.intel.com>

perf/x86/intel: Add sanity check for PEBS fragment size

Prevent potential infinite loops by adding a sanity check for the
corrupted PEBS fragment sizes which could happen in theory.

> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index e0d3076277026..3e100cb206a84 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -2675,6 +2675,9 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
>  
>  again:
>  	header = at;
> +	if (!header->size)
> +		return;
> +

[Severity: High]
This is a pre-existing issue, but does checking only that header->size is
non-zero leave us vulnerable to out-of-bounds memory accesses?

If a corrupted fragment has a small non-zero size (e.g., 2 bytes) but has
multiple capability flags set (like header->basic, header->lbr), wouldn't
setup_arch_pebs_sample_data() blindly increment its next_record pointer
to parse those structures? Could this allow it to read past the declared
fragment size and potentially beyond the mapped bounds of the PEBS buffer
memory page?

[ ... ]

> @@ -2956,13 +2959,21 @@ __intel_pmu_pebs_last_event(struct perf_event *event,
>  			    struct pt_regs *iregs,
>  			    struct pt_regs *regs,
>  			    struct perf_sample_data *data,
> -			    void *at,
> -			    int count,
> +			    void *at, int count, bool corrupted,
>  			    setup_fn setup_sample)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
>  
> -	setup_sample(event, iregs, at, data, regs);
> +	/* Skip parsing corrupted PEBS record. */
> +	if (corrupted) {
> +		/* Clear stale register states in previous records. */
> +		memset(regs, 0, sizeof(*regs));
> +		x86_pmu_clear_perf_regs(regs);
> +		perf_sample_data_init(data, 0, event->hw.last_period);
> +	} else {
> +		setup_sample(event, iregs, at, data, regs);
> +	}
> +

[Severity: High]
If setup_sample() is skipped when corrupted is true, perf_sample_data_init()
will initialize generic fields like period and sample_flags, but what
about PMU-specific fields such as data->weight, data->data_src, and
data->txn?

Since the perf_sample_data struct is allocated on the kernel stack and not
zeroed upon initialization, could this leave those fields containing stale
kernel stack data? When perf_event_output() or perf_event_overflow() are
subsequently called, would this result in a kernel stack memory leak to
userspace?

[ ... ]

> @@ -3327,7 +3345,7 @@ static int intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_da
>  	}
>  
>  	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
> -					    setup_pebs_adaptive_sample_data);
> +					    false, setup_pebs_adaptive_sample_data);
>  
>  	return hweight64(events_bitmap);
>  }

[Severity: High]
This is a pre-existing issue, but does intel_pmu_drain_pebs_icl() have a
similar infinite loop vulnerability for corrupted records with a size of 0?

Looking at the loop just above this code:

    for (at = base; at < top; at += basic->format_size) {
        u64 pebs_status;

        basic = at;
        if (basic->format_size != cpuc->pebs_record_size)
            continue;
    }

If a corrupted hardware record has basic->format_size == 0, the loop
condition evaluates to true, the continue statement executes, and the loop
attempts to advance at by 0. Wouldn't this result in at remaining
unchanged, creating an infinite loop in NMI context?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260529075645.580362-1-dapeng1.mi@linux.intel.com?part=23

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf
  2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (22 preceding siblings ...)
  2026-05-29  7:56 ` [Patch v8 23/23] perf/x86/intel: Add sanity check for PEBS fragment size Dapeng Mi
@ 2026-05-29  8:32 ` Mi, Dapeng
  23 siblings, 0 replies; 44+ messages in thread
From: Mi, Dapeng @ 2026-05-29  8:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

The corresponding perf tools support is here.
https://lore.kernel.org/all/20260529082451.591783-1-dapeng1.mi@linux.intel.com/

Thanks.


On 5/29/2026 3:56 PM, Dapeng Mi wrote:
> Patch layout:
> - Patches 1-6: Bug fixes and cleanup needed before enabling XSAVES-based
>   sampling in NMI context
> - Patches 7-9: FPU-related preparation, including xsaves_nmi() and
>   related cleanup/optimization
> - Patches 10-11: PMI-based XMM sampling support through the existing
>   sample_regs_intr/sample_regs_user interfaces for both
>   PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
> - Patches 12-19: New SIMD register interface and support for
>   XMM/YMM/ZMM/OPMASK, APX eGPRs, and SSP through that interface
> - Patch 20: Extend arch PEBS to support YMM/ZMM/OPMASK, APX eGPRs, and
>   SSP with the new interface
> - Patch 21: Enable new interface-based sampling
> - Patches 22-23: arch PEBS bug fix and sanity check
>
> Changes since V7:
> - Validate the return value of intel_pmu_init_hybrid() (Patch 01/23).
> - Replace pt_regs with x86_perf_regs in xen_pmu_irq_handler()
>   (Patch 06/23).
> - Improve event_has_extended_regs() (Patch 09/23).
> - Explicitly ensure the allocated XSAVE area is 64-byte aligned
>   (Patch 10/23, Sashiko).
> - Clear the SIMD register pointers in x86_user_regs to avoid exposing
>   stale register data to user space (Patch 11/23, Sashiko).
> - Refine the SIMD register interface and sample data layout, and add the
>   missing SIMD data reservation in perf_prepare_sample() for non-x86
>   architectures (Patch 12/23, Sashiko).
> - Improve perf_simd_reg_validate() for x86 (Patch 13/23, Sashiko).
> - Refine SSP sampling and ensure the GPR sub-group flag is set for PEBS
>   (Patch 19/23, Sashiko).
> - Fix the incorrect large-PEBS check for XMM (Patch 20/23, Sashiko).
> - Fix missing handling in x86_pmu_handle_guest_pebs() for back-to-back
>   PMI detection (Patch 22/23, Sashiko).
> - Strengthen the PEBS record header sanity checks to prevent invalid
>   memory access (Patch 23/23, Sashiko).
>
> Changes since V6:
> - Fix potential overwritten issue in hybrid PMU structure (patch 01/24)
> - Restrict PEBS events work on GP counters if no PEBS baseline suggested
>   (patch 02/24)
> - Use per-cpu x86_intr_regs for perf_event_nmi_handler() instead of
>   temporary variable (patch 06/24)
> - Add helper update_fpu_state_and_flag() to ensure TIF_NEED_FPU_LOAD is
>   set after save_fpregs_to_fpstate() call (patch 09/24)
> - Optimize and simplify x86_pmu_sample_xregs(), etc. (patch 11/24)
> - Add macro word_for_each_set_bit() to simplify u64 set-bit iteration
>   (patch 13/24)
> - Add sanity check for PEBS fragment size (patch 24/24)
>
> Changes since V5:
> - Introduce 3 commits to fix newly found PEBS issues (Patch 01~03/19)
> - Address Peter comments, including,
>   * Fully support user-regs sampling of the SIMD/eGPRs/SSP registers
>   * Adjust newly added fields in perf_event_attr to avoid holes
>   * Fix the endian issue introduced by for_each_set_bit() in
>     event/core.c
>   * Remove some unnecessary macros from UAPI header perf_regs.h
>   * Enhance b2b NMI detection for all PEBS handlers to ensure identical
>     behaviors of all PEBS handlers
> - Split perf-tools patches which would be posted in a separate patchset
>   later
>
> Changes since V4:
> - Rewrite some functions comments and commit messages (Dave)
> - Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
> - Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
>   activating back-to-back NMI detection mechanism (Patch 16/19)
> - Fix some minor issues on perf-tool patches (Patch 18/19)
>
> Changes since V3:
> - Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
> - Only dump the available regs, rather than zero and dump the
>   unavailable regs. It's possible that the dumped registers are a subset
>   of the requested registers.
> - Some minor updates to address Dapeng's comments in V3.
>
> Changes since V2:
> - Use the FPU format for the x86_pmu.ext_regs_mask as well
> - Add a check before invoking xsaves_nmi()
> - Add perf_simd_reg_check() to retrieve the number of available
>   registers. If the kernel fails to get the requested registers, e.g.,
>   XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
> - Add POC perf tool patches
>
> Changes since V1:
> - Apply the new interfaces to configure and dump the SIMD registers
> - Utilize the existing FPU functions, e.g., xstate_calculate_size,
>   get_xsave_addr().
>
>
> This series adds support on x86 for sampling SIMD registers, APX eGPRs,
> and SSP with both PMI-based and PEBS-based sampling.
>
> Starting with Intel Ice Lake, PEBS can sample XMM registers, but PMI-based
> XMM sampling is still not available. On newer Intel platforms with
> architectural PEBS support, such as Clearwater Forest and Diamond Rapids,
> the hardware also gains support for sampling additional SIMD state
> (XMM/YMM/ZMM/OPMASK), APX extended GPRs, and SSP.
>
> To support these registers consistently across both PMI and PEBS, this
> series makes the following changes:
>
> 1. Adds a new perf_event_attr interface for SIMD register selection.
>    The existing sample_regs_user/sample_regs_intr bitmaps do not have
>    enough space to represent the full SIMD register set, so this series
>    introduces dedicated fields for SIMD and predicate register masks and
>    element widths.
>
> 2. Introduces a new sample data layout for SIMD register data.
>    SIMD register payload is appended after the GPR payload, and a new ABI
>    flag, PERF_SAMPLE_REGS_ABI_SIMD, indicates its presence.
>
> 3. Adds xsaves_nmi() to allow SIMD/eGPR/SSP sampling from PMI handlers in
>    NMI context.
>
> 4. Extends the arch PEBS path to support YMM/ZMM/OPMASK, APX eGPRs, and
>    SSP sampling.
>
>
> New perf_event_attr fields
> --------------------------
>
> This series adds the following fields to perf_event_attr:
>
>     /*
>      * Defines the sampling SIMD/PRED(predicate) register bitmaps and
>      * qword (8-byte) lengths.
>      *
>      * sample_simd_regs_enabled != 0 indicates SIMD/PRED registers are
>      * requested. The register bitmaps and element sizes are described by:
>      *
>      *   sample_simd_{vec,pred}_reg_{intr,user}
>      *   sample_simd_{vec,pred}_reg_qwords
>      *
>      * sample_simd_regs_enabled == 0 indicates no SIMD/PRED registers are
>      * requested.
>      */
>     __u16 sample_simd_regs_enabled;
>     __u16 sample_simd_pred_reg_qwords;
>     __u16 sample_simd_vec_reg_qwords;
>     __u16 __reserved_4;
>
>     __u32 sample_simd_pred_reg_intr;
>     __u32 sample_simd_pred_reg_user;
>     __u64 sample_simd_vec_reg_intr;
>     __u64 sample_simd_vec_reg_user;
>
> Field semantics:
> - sample_simd_vec_reg_qwords: qword count for regular SIMD registers
> - sample_simd_pred_reg_qwords: qword count for predicate registers
> - sample_simd_vec_reg_{intr,user}: SIMD register masks for
>   PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
> - sample_simd_pred_reg_{intr,user}: predicate register masks for
>   PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
> - sample_simd_regs_enabled: indicates whether the new SIMD fields are in use
>
> Examples:
>
> To sample ZMM registers for PERF_SAMPLE_REGS_INTR:
>
>     sample_simd_regs_enabled = 1
>     sample_simd_vec_reg_qwords = 8          // 512 bits = 8 qwords
>     sample_simd_vec_reg_intr = 0xffffffff   // zmm0-zmm31
>
> To sample OPMASK registers for PERF_SAMPLE_REGS_USER:
>
>     sample_simd_regs_enabled = 1
>     sample_simd_pred_reg_qwords = 1         // 64 bits = 1 qword
>     sample_simd_pred_reg_user = 0xff        // opmask0-opmask7
>
> After introducing these fields, bits [63:32] in sample_regs_user and
> sample_regs_intr are reclaimed for APX eGPRs and SSP instead of the
> previous XMM0-XMM15 encoding.
>
> Discussion of the new SIMD register interface is available at:
> https://lore.kernel.org/lkml/20250617081458.GI1613376@noisy.programming.kicks-ass.net/
>
> Sample data layout
> ------------------
>
> SIMD register data is appended after the GPR data.
>
> For PERF_SAMPLE_REGS_USER:
>
>     { u64 abi;                      // enum perf_sample_regs_abi
>       u64 regs[weight(mask)];
>       struct {
>             u64 nr_vectors;         // 0 ... weight(sample_simd_vec_reg_user)
>             u64 vector_qwords;      // 0 ... sample_simd_vec_reg_qwords
>             u64 nr_pred;            // 0 ... weight(sample_simd_pred_reg_user)
>             u64 pred_qwords;        // 0 ... sample_simd_pred_reg_qwords
>             u64 data[nr_vectors * vector_qwords +
>                      nr_pred * pred_qwords];
>       } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>     }
>
> For PERF_SAMPLE_REGS_INTR:
>
>     { u64 abi;                      // enum perf_sample_regs_abi
>       u64 regs[weight(mask)];
>       struct {
>             u64 nr_vectors;         // 0 ... weight(sample_simd_vec_reg_intr)
>             u64 vector_qwords;      // 0 ... sample_simd_vec_reg_qwords
>             u64 nr_pred;            // 0 ... weight(sample_simd_pred_reg_intr)
>             u64 pred_qwords;        // 0 ... sample_simd_pred_reg_qwords
>             u64 data[nr_vectors * vector_qwords +
>                      nr_pred * pred_qwords];
>       } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>     }
>
> PERF_SAMPLE_REGS_ABI_SIMD indicates that SIMD register data is present.
>
> The metadata fields are encoded as u64 to keep perf tool parsing and
> cross-endian support straightforward.
>
> Example
> -------
>
>   $ perf record -I?
>   available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>   R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27
>   R28 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
>   $ perf record --user-regs=?
>   available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>   R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27
>   R28 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
>   $ perf record -e branches:p \
>         -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
>         -c 100000 ./test
>   $ perf report -D
>
>   ...
>   14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
>   0xffffffff9f085e24 period: 100000 addr: 0
>   ... intr regs: mask 0x18001010003 ABI 64-bit
>   .... AX    0xdffffc0000000000
>   .... BX    0xffff8882297685e8
>   .... R8    0x0000000000000000
>   .... R16   0x0000000000000000
>   .... R31   0x0000000000000000
>   .... SSP   0x0000000000000000
>   ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
>   .... ZMM[0][0] 0x616c2f656d6f682f
>   .... ZMM[0][1] 0x696c2f7265737562
>   ...
>   .... ZMM[31][7] 0x0000000000000000
>   .... OPMASK[0] 0x00000000fffffe00
>   ....
>   .... OPMASK[7] 0x0000000000000000
>   ...
>
> Testing
> -------
>
> The following intr-regs, user-regs, and combined sampling tests were run
> on DMR and NVL. The sampled register data was reported correctly and no
> issues were observed.
>
>   $ ./perf record -e branches:p \
>         -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -b -c 10000 sleep 1
>
>   $ ./perf record -e branches \
>         -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -b -c 10000 sleep 1
>
>   $ ./perf record -e branches:p \
>         --user-regs=ax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
>         -b -c 10000 sleep 1
>
>   $ ./perf record -e branches \
>         --user-regs=ax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
>         -b -c 10000 sleep 1
>
>   $ ./perf record -e branches:p \
>         -Ixmm,ymm,zmm,opmask \
>         --user-regs=ax,bx,r8,r16,r31,ssp \
>         -b -c 10000 sleep 1
>
>   $ ./perf record -e branches:p \
>         --user-regs=xmm,ymm,zmm,opmask \
>         -Iax,bx,r8,r16,r31,ssp \
>         -b -c 10000 sleep 1
>
>   $ ./perf record -e branches:p \
>         -Iax,bx,r9,r17,r30,ssp \
>         --user-regs=ax,bx,r8,r16,r31,ssp \
>         -b -c 10000 sleep 1
>
>   $ ./perf record -e branches:p \
>         -Ixmm,opmask --user-regs=zmm \
>         -b -c 10000 taskset -c 0 sleep 1
>
>
> History:
>   v7: https://lore.kernel.org/all/20260324004118.3772171-1-dapeng1.mi@linux.intel.com/
>   v6: https://lore.kernel.org/all/20260209072047.2180332-1-dapeng1.mi@linux.intel.com/
>   v5: https://lore.kernel.org/all/20251203065500.2597594-1-dapeng1.mi@linux.intel.com/
>   v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
>   v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
>   v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
>   v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/
>
> Dapeng Mi (19):
>   perf/x86/intel: Validate return value of intel_pmu_init_hybrid()
>   perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu()
>   perf/x86/intel: Enable large PEBS sampling for XMMs
>   perf/x86/intel: Convert x86_perf_regs to per-cpu variables
>   perf: Eliminate duplicate arch-specific functions definations
>   perf/x86: Use x86_perf_regs in the x86 nmi handlers
>   x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
>   perf/x86: Enable XMM Register Sampling for Non-PEBS Events
>   perf/x86: Enable XMM register sampling for REGS_USER case
>   perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Support YMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Support ZMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields
>   perf: Enhance perf_reg_validate() with simd_enabled argument
>   perf/x86: Support eGPRs sampling using sample_regs_* fields
>   perf/x86: Support SSP sampling using sample_regs_* fields
>   perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling
>   perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
>     NMIs
>   perf/x86/intel: Add sanity check for PEBS fragment size
>
> Kan Liang (4):
>   x86/fpu/xstate: Add xsaves_nmi() helper
>   perf: Move and enhance has_extended_regs() for arch-specific use
>   perf: Add sampling support for SIMD registers
>   perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
>
>  arch/arm/kernel/perf_regs.c           |   8 +-
>  arch/arm64/kernel/perf_regs.c         |   8 +-
>  arch/csky/kernel/perf_regs.c          |   8 +-
>  arch/loongarch/kernel/perf_regs.c     |   8 +-
>  arch/mips/kernel/perf_regs.c          |   8 +-
>  arch/parisc/kernel/perf_regs.c        |   8 +-
>  arch/powerpc/perf/perf_regs.c         |   2 +-
>  arch/riscv/kernel/perf_regs.c         |   8 +-
>  arch/s390/kernel/perf_regs.c          |   2 +-
>  arch/x86/events/core.c                | 415 +++++++++++++++++++++++++-
>  arch/x86/events/intel/core.c          | 232 ++++++++++++--
>  arch/x86/events/intel/ds.c            | 235 +++++++++++----
>  arch/x86/events/perf_event.h          |  85 +++++-
>  arch/x86/include/asm/fpu/sched.h      |   5 +-
>  arch/x86/include/asm/fpu/xstate.h     |   3 +
>  arch/x86/include/asm/msr-index.h      |   7 +
>  arch/x86/include/asm/perf_event.h     |  35 ++-
>  arch/x86/include/uapi/asm/perf_regs.h |  51 ++++
>  arch/x86/kernel/fpu/core.c            |  27 +-
>  arch/x86/kernel/fpu/xstate.c          |  25 +-
>  arch/x86/kernel/perf_regs.c           | 163 ++++++++--
>  arch/x86/xen/pmu.c                    |   5 +-
>  include/linux/perf_event.h            |  19 ++
>  include/linux/perf_regs.h             |  38 +--
>  include/uapi/linux/perf_event.h       |  49 ++-
>  kernel/events/core.c                  | 189 ++++++++++--
>  26 files changed, 1418 insertions(+), 225 deletions(-)
>
>
> base-commit: 66cc29745f2f5815482587bb9fbc1e8a3e6fcf00

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2026-05-29 11:43 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-29  7:56 [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
2026-05-29  7:56 ` [Patch v8 01/23] perf/x86/intel: Validate return value of intel_pmu_init_hybrid() Dapeng Mi
2026-05-29  8:53   ` sashiko-bot
2026-05-29 11:11   ` Peter Zijlstra
2026-05-29  7:56 ` [Patch v8 02/23] perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu() Dapeng Mi
2026-05-29  8:51   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 03/23] perf/x86/intel: Enable large PEBS sampling for XMMs Dapeng Mi
2026-05-29  7:56 ` [Patch v8 04/23] perf/x86/intel: Convert x86_perf_regs to per-cpu variables Dapeng Mi
2026-05-29  7:56 ` [Patch v8 05/23] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
2026-05-29  7:56 ` [Patch v8 06/23] perf/x86: Use x86_perf_regs in the x86 nmi handlers Dapeng Mi
2026-05-29  7:56 ` [Patch v8 07/23] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
2026-05-29  8:56   ` sashiko-bot
2026-05-29 11:32   ` Peter Zijlstra
2026-05-29  7:56 ` [Patch v8 08/23] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state Dapeng Mi
2026-05-29  7:56 ` [Patch v8 09/23] perf: Move and enhance has_extended_regs() for arch-specific use Dapeng Mi
2026-05-29  7:56 ` [Patch v8 10/23] perf/x86: Enable XMM Register Sampling for Non-PEBS Events Dapeng Mi
2026-05-29  9:02   ` sashiko-bot
2026-05-29 11:38   ` Peter Zijlstra
2026-05-29  7:56 ` [Patch v8 11/23] perf/x86: Enable XMM register sampling for REGS_USER case Dapeng Mi
2026-05-29  9:24   ` sashiko-bot
2026-05-29 11:42   ` Peter Zijlstra
2026-05-29  7:56 ` [Patch v8 12/23] perf: Add sampling support for SIMD registers Dapeng Mi
2026-05-29  8:36   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 13/23] perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
2026-05-29  8:49   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 14/23] perf/x86: Support YMM " Dapeng Mi
2026-05-29  8:47   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 15/23] perf/x86: Support ZMM " Dapeng Mi
2026-05-29  7:56 ` [Patch v8 16/23] perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
2026-05-29  9:21   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 17/23] perf: Enhance perf_reg_validate() with simd_enabled argument Dapeng Mi
2026-05-29  7:56 ` [Patch v8 18/23] perf/x86: Support eGPRs sampling using sample_regs_* fields Dapeng Mi
2026-05-29  9:31   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 19/23] perf/x86: Support SSP " Dapeng Mi
2026-05-29 10:03   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 20/23] perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
2026-05-29  9:45   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 21/23] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
2026-05-29 10:43   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 22/23] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
2026-05-29  9:34   ` sashiko-bot
2026-05-29  7:56 ` [Patch v8 23/23] perf/x86/intel: Add sanity check for PEBS fragment size Dapeng Mi
2026-05-29  9:54   ` sashiko-bot
2026-05-29  8:32 ` [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf Mi, Dapeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox