public inbox for linux-perf-users@vger.kernel.org
 help / color / mirror / Atom feed
* [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf
@ 2026-02-09  7:20 Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 01/22] perf/x86/intel: Restrict PEBS_ENABLE writes to PEBS-capable counters Dapeng Mi
                   ` (22 more replies)
  0 siblings, 23 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Changes since V5:
- Introduce 3 commits to fix newly found PEBS issues (Patch 01~03/19)
- Address Peter comments, including,
  * Fully support user-regs sampling of the SIMD/eGPRs/SSP registers
  * Adjust newly added fields in perf_event_attr to avoid holes
  * Fix the endian issue introduced by for_each_set_bit() in
    event/core.c
  * Remove some unnecessary macros from UAPI header perf_regs.h
  * Enhance b2b NMI detection for all PEBS handlers to ensure identical
    behaviors of all PEBS handlers
- Split perf-tools patches which would be posted in a separate patchset
  later

Changes since V4:
- Rewrite some functions comments and commit messages (Dave)
- Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
- Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
  activating back-to-back NMI detection mechanism (Patch 16/19)
- Fix some minor issues on perf-tool patches (Patch 18/19)

Changes since V3:
- Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
- Only dump the available regs, rather than zero and dump the
  unavailable regs. It's possible that the dumped registers are a subset
  of the requested registers.
- Some minor updates to address Dapeng's comments in V3.

Changes since V2:
- Use the FPU format for the x86_pmu.ext_regs_mask as well
- Add a check before invoking xsaves_nmi()
- Add perf_simd_reg_check() to retrieve the number of available
  registers. If the kernel fails to get the requested registers, e.g.,
  XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
- Add POC perf tool patches

Changes since V1:
- Apply the new interfaces to configure and dump the SIMD registers
- Utilize the existing FPU functions, e.g., xstate_calculate_size,
  get_xsave_addr().

Starting from Intel Ice Lake, XMM registers can be collected in a PEBS
record. Future Architecture PEBS will include additional registers such
as YMM, ZMM, OPMASK, SSP and APX eGPRs, contingent on hardware support.

This patch set introduces a software solution to mitigate the hardware
requirement by utilizing the XSAVES command to retrieve the requested
registers in the overflow handler. This feature is no longer limited to
PEBS events or specific platforms. While the hardware solution remains
preferable due to its lower overhead and higher accuracy, this software
approach provides a viable alternative.

The solution is theoretically compatible with all x86 platforms but is
currently enabled on newer platforms, including Sapphire Rapids and
later P-core server platforms, Sierra Forest and later E-core server
platforms and recent Client platforms, like Arrow Lake, Panther Lake and
Nova Lake.

Newly supported registers include YMM, ZMM, OPMASK, SSP, and APX eGPRs.
Due to space constraints in sample_regs_user/intr, new fields have been 
introduced in the perf_event_attr structure to accommodate these
registers.

After a long discussion in V1,
https://lore.kernel.org/lkml/3f1c9a9e-cb63-47ff-a5e9-06555fa6cc9a@linux.intel.com/
The below new fields are introduced.

@@ -547,6 +549,25 @@ struct perf_event_attr {

        __u64   config3; /* extension of config2 */
        __u64   config4; /* extension of config3 */
+
+       /*
+        * Defines set of SIMD registers to dump on samples.
+        * The sample_simd_regs_enabled !=0 implies the
+        * set of SIMD registers is used to config all SIMD registers.
+        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+        * config some SIMD registers on X86.
+        */
+       union {
+               __u16 sample_simd_regs_enabled;
+               __u16 sample_simd_pred_reg_qwords;
+       };
+       __u16   sample_simd_vec_reg_qwords;
+       __u32   __reserved_4;
+
+       __u32   sample_simd_pred_reg_intr;
+       __u32   sample_simd_pred_reg_user;
+       __u64   sample_simd_vec_reg_intr;
+       __u64   sample_simd_vec_reg_user;
 };

 /*
@@ -1020,7 +1041,15 @@ enum perf_event_type {
         *      } && PERF_SAMPLE_BRANCH_STACK
         *
         *      { u64                   abi; # enum perf_sample_regs_abi
-        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;         # 0 ... weight(sample_simd_vec_reg_user)
+        *              u16 vector_qwords;      # 0 ... sample_simd_vec_reg_qwords
+        *              u16 nr_pred;            # 0 ... weight(sample_simd_pred_reg_user)
+        *              u16 pred_qwords;        # 0 ... sample_simd_pred_reg_qwords
+        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_USER
         *
         *      { u64                   size;
         *        char                  data[size];
@@ -1047,7 +1076,15 @@ enum perf_event_type {
         *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
         *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
         *      { u64                   abi; # enum perf_sample_regs_abi
-        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;         # 0 ... weight(sample_simd_vec_reg_intr)
+        *              u16 vector_qwords;      # 0 ... sample_simd_vec_reg_qwords
+        *              u16 nr_pred;            # 0 ... weight(sample_simd_pred_reg_intr)
+        *              u16 pred_qwords;        # 0 ... sample_simd_pred_reg_qwords
+        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_INTR
         *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
         *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
         *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE


To maintain simplicity, a single field, sample_{simd|pred}_vec_reg_qwords,
is introduced to indicate register width. For example:
- sample_simd_vec_reg_qwords = 2 for XMM registers (128 bits) on x86
- sample_simd_vec_reg_qwords = 4 for YMM registers (256 bits) on x86

Four additional fields, sample_{simd|pred}_vec_reg_{intr|user}, represent
the bitmap of sampling registers. For instance, the bitmap for x86
XMM registers is 0xffff (16 XMM registers). Although users can
theoretically sample a subset of registers, the current perf-tool
implementation supports sampling all registers of each type to avoid
complexity.

A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is introduced to signal user space 
tools about the presence of SIMD registers in sampling records. When this
flag is detected, tools should recognize that extra SIMD register data
follows the general register data. The layout of the extra SIMD register
data is displayed as follow.

   u16 nr_vectors;
   u16 vector_qwords;
   u16 nr_pred;
   u16 pred_qwords;
   u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];

With this patch set, sampling for the aforementioned registers is
supported on the Intel Nova Lake platform.

Examples:
 $perf record -I?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

 $perf record --user-regs=?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

 $perf record -e branches:p -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -c 100000 ./test
 $perf report -D

 ... ...
 14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
 0xffffffff9f085e24 period: 100000 addr: 0
 ... intr regs: mask 0x18001010003 ABI 64-bit
 .... AX    0xdffffc0000000000
 .... BX    0xffff8882297685e8
 .... R8    0x0000000000000000
 .... R16   0x0000000000000000
 .... R31   0x0000000000000000
 .... SSP   0x0000000000000000
 ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
 .... ZMM  [0] 0xffffffffffffffff
 .... ZMM  [0] 0x0000000000000001
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [1] 0x003a6b6165506d56
 ... ...
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... OPMASK[0] 0x00000000fffffe00
 .... OPMASK[1] 0x0000000000ffffff
 .... OPMASK[2] 0x000000000000007f
 .... OPMASK[3] 0x0000000000000000
 .... OPMASK[4] 0x0000000000010080
 .... OPMASK[5] 0x0000000000000000
 .... OPMASK[6] 0x0000400004000000
 .... OPMASK[7] 0x0000000000000000
 ... ...


History:
  v5: https://lore.kernel.org/all/20251203065500.2597594-1-dapeng1.mi@linux.intel.com/
  v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
  v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
  v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
  v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/


Dapeng Mi (10):
  perf/x86/intel: Restrict PEBS_ENABLE writes to PEBS-capable counters
  perf/x86/intel: Enable large PEBS sampling for XMMs
  perf/x86/intel: Convert x86_perf_regs to per-cpu variables
  perf: Eliminate duplicate arch-specific functions definations
  x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  perf/x86: Enable XMM register sampling for REGS_USER case
  perf: Enhance perf_reg_validate() with simd_enabled argument
  perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling
  perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
    NMIs

Kan Liang (12):
  perf/x86: Use x86_perf_regs in the x86 nmi handler
  perf/x86: Introduce x86-specific x86_pmu_setup_regs_data()
  x86/fpu/xstate: Add xsaves_nmi() helper
  perf: Move and rename has_extended_regs() for ARCH-specific use
  perf: Add sampling support for SIMD registers
  perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Enable YMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Enable ZMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields
  perf/x86: Enable eGPRs sampling using sample_regs_* fields
  perf/x86: Enable SSP sampling using sample_regs_* fields
  perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability

 arch/arm/kernel/perf_regs.c           |   8 +-
 arch/arm64/kernel/perf_regs.c         |   8 +-
 arch/csky/kernel/perf_regs.c          |   8 +-
 arch/loongarch/kernel/perf_regs.c     |   8 +-
 arch/mips/kernel/perf_regs.c          |   8 +-
 arch/parisc/kernel/perf_regs.c        |   8 +-
 arch/powerpc/perf/perf_regs.c         |   2 +-
 arch/riscv/kernel/perf_regs.c         |   8 +-
 arch/s390/kernel/perf_regs.c          |   2 +-
 arch/x86/events/core.c                | 387 +++++++++++++++++++++++++-
 arch/x86/events/intel/core.c          | 131 ++++++++-
 arch/x86/events/intel/ds.c            | 164 ++++++++---
 arch/x86/events/perf_event.h          |  85 +++++-
 arch/x86/include/asm/fpu/sched.h      |   2 +-
 arch/x86/include/asm/fpu/xstate.h     |   3 +
 arch/x86/include/asm/msr-index.h      |   7 +
 arch/x86/include/asm/perf_event.h     |  38 ++-
 arch/x86/include/uapi/asm/perf_regs.h |  49 ++++
 arch/x86/kernel/fpu/core.c            |  12 +-
 arch/x86/kernel/fpu/xstate.c          |  25 +-
 arch/x86/kernel/perf_regs.c           | 134 +++++++--
 include/linux/perf_event.h            |  16 ++
 include/linux/perf_regs.h             |  36 +--
 include/uapi/linux/perf_event.h       |  45 ++-
 kernel/events/core.c                  | 132 ++++++++-
 25 files changed, 1144 insertions(+), 182 deletions(-)


base-commit: 7db06e329af30dcb170a6782c1714217ad65033d
-- 
2.34.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [Patch v6 01/22] perf/x86/intel: Restrict PEBS_ENABLE writes to PEBS-capable counters
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-10 15:36   ` Peter Zijlstra
  2026-02-09  7:20 ` [Patch v6 02/22] perf/x86/intel: Enable large PEBS sampling for XMMs Dapeng Mi
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Before the introduction of extended PEBS, PEBS supported only
general-purpose (GP) counters. In a virtual machine (VM) environment,
the PEBS_BASELINE bit in PERF_CAPABILITIES may not be set, but the PEBS
format could be indicated as 4 or higher. In such cases, PEBS events
might be scheduled to fixed counters, and writing the corresponding bits
into the PEBS_ENABLE MSR could cause a #GP fault.

To prevent writing unsupported bits into the PEBS_ENABLE MSR, ensure
cpuc->pebs_enabled aligns with x86_pmu.pebs_capable and restrict the
writes to only PEBS-capable counter bits.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V6: new patch.

 arch/x86/events/intel/core.c |  6 ++++--
 arch/x86/events/intel/ds.c   | 11 +++++++----
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f3ae1f8ee3cd..546ebc7e1624 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3554,8 +3554,10 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 		 * cpuc->enabled has been forced to 0 in PMI.
 		 * Update the MSR if pebs_enabled is changed.
 		 */
-		if (pebs_enabled != cpuc->pebs_enabled)
-			wrmsrq(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
+		if (pebs_enabled != cpuc->pebs_enabled) {
+			wrmsrq(MSR_IA32_PEBS_ENABLE,
+			       cpuc->pebs_enabled & x86_pmu.pebs_capable);
+		}
 
 		/*
 		 * Above PEBS handler (PEBS counters snapshotting) has updated fixed
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 5027afc97b65..57805c6ba0c3 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1963,6 +1963,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
+	u64 pebs_enabled;
 
 	__intel_pmu_pebs_disable(event);
 
@@ -1974,16 +1975,18 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 
 	intel_pmu_pebs_via_pt_disable(event);
 
-	if (cpuc->enabled)
-		wrmsrq(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
+	pebs_enabled = cpuc->pebs_enabled & x86_pmu.pebs_capable;
+	if (pebs_enabled)
+		wrmsrq(MSR_IA32_PEBS_ENABLE, pebs_enabled);
 }
 
 void intel_pmu_pebs_enable_all(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	u64 pebs_enabled = cpuc->pebs_enabled & x86_pmu.pebs_capable;
 
-	if (cpuc->pebs_enabled)
-		wrmsrq(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
+	if (pebs_enabled)
+		wrmsrq(MSR_IA32_PEBS_ENABLE, pebs_enabled);
 }
 
 void intel_pmu_pebs_disable_all(void)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 02/22] perf/x86/intel: Enable large PEBS sampling for XMMs
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 01/22] perf/x86/intel: Restrict PEBS_ENABLE writes to PEBS-capable counters Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 03/22] perf/x86/intel: Convert x86_perf_regs to per-cpu variables Dapeng Mi
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Modern PEBS hardware supports directly sampling XMM registers, then
large PEBS can be enabled for XMM registers just like other GPRs.

Reported-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V6: new patch.

 arch/x86/events/intel/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 546ebc7e1624..5ed26b83c61d 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4425,7 +4425,8 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 		flags &= ~PERF_SAMPLE_REGS_USER;
 	if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
 		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_intr & ~PEBS_GP_REGS)
+	if (event->attr.sample_regs_intr &
+	    ~(PEBS_GP_REGS | PERF_REG_EXTENDED_MASK))
 		flags &= ~PERF_SAMPLE_REGS_INTR;
 	return flags;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 03/22] perf/x86/intel: Convert x86_perf_regs to per-cpu variables
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 01/22] perf/x86/intel: Restrict PEBS_ENABLE writes to PEBS-capable counters Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 02/22] perf/x86/intel: Enable large PEBS sampling for XMMs Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 04/22] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Currently, the intel_pmu_drain_pebs_icl() and intel_pmu_drain_arch_pebs()
helpers define many temporary variables. Upcoming patches will add new
fields like *ymm_regs and *zmm_regs to the x86_perf_regs structure to
support sampling for these SIMD registers. This would increase the stack
size consumed by these helpers, potentially triggering the warning:
"the frame size of 1048 bytes is larger than 1024 bytes
 [-Wframe-larger-than=]".

To eliminate this warning, convert x86_perf_regs to per-cpu variables.

No functional changes are intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V6: new patch.

 arch/x86/events/intel/ds.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 57805c6ba0c3..87bf8672f5a8 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -3174,14 +3174,16 @@ __intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
 
 }
 
+DEFINE_PER_CPU(struct x86_perf_regs, pebs_perf_regs);
+
 static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
 	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
-	struct x86_perf_regs perf_regs;
-	struct pt_regs *regs = &perf_regs.regs;
+	struct x86_perf_regs *perf_regs = this_cpu_ptr(&pebs_perf_regs);
+	struct pt_regs *regs = &perf_regs->regs;
 	struct pebs_basic *basic;
 	void *base, *at, *top;
 	u64 mask;
@@ -3231,8 +3233,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	union arch_pebs_index index;
-	struct x86_perf_regs perf_regs;
-	struct pt_regs *regs = &perf_regs.regs;
+	struct x86_perf_regs *perf_regs = this_cpu_ptr(&pebs_perf_regs);
+	struct pt_regs *regs = &perf_regs->regs;
 	void *base, *at, *top;
 	u64 mask;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 04/22] perf: Eliminate duplicate arch-specific functions definations
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (2 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 03/22] perf/x86/intel: Convert x86_perf_regs to per-cpu variables Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 05/22] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Define default common __weak functions for perf_reg_value(),
perf_reg_validate(), perf_reg_abi() and perf_get_regs_user(). This helps
to eliminate the duplicated arch-specific definations.

No function changes intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/arm/kernel/perf_regs.c       |  6 ------
 arch/arm64/kernel/perf_regs.c     |  6 ------
 arch/csky/kernel/perf_regs.c      |  6 ------
 arch/loongarch/kernel/perf_regs.c |  6 ------
 arch/mips/kernel/perf_regs.c      |  6 ------
 arch/parisc/kernel/perf_regs.c    |  6 ------
 arch/riscv/kernel/perf_regs.c     |  6 ------
 arch/x86/kernel/perf_regs.c       |  6 ------
 include/linux/perf_regs.h         | 32 ++++++-------------------------
 kernel/events/core.c              | 22 +++++++++++++++++++++
 10 files changed, 28 insertions(+), 74 deletions(-)

diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index 0529f90395c9..d575a4c3ca56 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -31,9 +31,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index b4eece3eb17d..70e2f13f587f 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -98,9 +98,3 @@ u64 perf_reg_abi(struct task_struct *task)
 		return PERF_SAMPLE_REGS_ABI_64;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/csky/kernel/perf_regs.c b/arch/csky/kernel/perf_regs.c
index 09b7f88a2d6a..94601f37b596 100644
--- a/arch/csky/kernel/perf_regs.c
+++ b/arch/csky/kernel/perf_regs.c
@@ -31,9 +31,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/loongarch/kernel/perf_regs.c b/arch/loongarch/kernel/perf_regs.c
index 263ac4ab5af6..8dd604f01745 100644
--- a/arch/loongarch/kernel/perf_regs.c
+++ b/arch/loongarch/kernel/perf_regs.c
@@ -45,9 +45,3 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs->regs[idx];
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/mips/kernel/perf_regs.c b/arch/mips/kernel/perf_regs.c
index e686780d1647..7736d3c5ebd2 100644
--- a/arch/mips/kernel/perf_regs.c
+++ b/arch/mips/kernel/perf_regs.c
@@ -60,9 +60,3 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return (s64)v; /* Sign extend if 32-bit. */
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/parisc/kernel/perf_regs.c b/arch/parisc/kernel/perf_regs.c
index 10a1a5f06a18..b9fe1f2fcb9b 100644
--- a/arch/parisc/kernel/perf_regs.c
+++ b/arch/parisc/kernel/perf_regs.c
@@ -53,9 +53,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_64;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/riscv/kernel/perf_regs.c b/arch/riscv/kernel/perf_regs.c
index fd304a248de6..3bba8deababb 100644
--- a/arch/riscv/kernel/perf_regs.c
+++ b/arch/riscv/kernel/perf_regs.c
@@ -35,9 +35,3 @@ u64 perf_reg_abi(struct task_struct *task)
 #endif
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 624703af80a1..81204cb7f723 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -100,12 +100,6 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
 #else /* CONFIG_X86_64 */
 #define REG_NOSUPPORT ((1ULL << PERF_REG_X86_DS) | \
 		       (1ULL << PERF_REG_X86_ES) | \
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index f632c5725f16..144bcc3ff19f 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -9,6 +9,12 @@ struct perf_regs {
 	struct pt_regs	*regs;
 };
 
+u64 perf_reg_value(struct pt_regs *regs, int idx);
+int perf_reg_validate(u64 mask);
+u64 perf_reg_abi(struct task_struct *task);
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs);
+
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
 
@@ -16,35 +22,9 @@ struct perf_regs {
 #define PERF_REG_EXTENDED_MASK	0
 #endif
 
-u64 perf_reg_value(struct pt_regs *regs, int idx);
-int perf_reg_validate(u64 mask);
-u64 perf_reg_abi(struct task_struct *task);
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs);
 #else
 
 #define PERF_REG_EXTENDED_MASK	0
 
-static inline u64 perf_reg_value(struct pt_regs *regs, int idx)
-{
-	return 0;
-}
-
-static inline int perf_reg_validate(u64 mask)
-{
-	return mask ? -ENOSYS : 0;
-}
-
-static inline u64 perf_reg_abi(struct task_struct *task)
-{
-	return PERF_SAMPLE_REGS_ABI_NONE;
-}
-
-static inline void perf_get_regs_user(struct perf_regs *regs_user,
-				      struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
 #endif /* CONFIG_HAVE_PERF_REGS */
 #endif /* _LINUX_PERF_REGS_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index da013b9a595f..8410b1a7ef3b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7723,6 +7723,28 @@ unsigned long perf_instruction_pointer(struct perf_event *event,
 	return perf_arch_instruction_pointer(regs);
 }
 
+u64 __weak perf_reg_value(struct pt_regs *regs, int idx)
+{
+	return 0;
+}
+
+int __weak perf_reg_validate(u64 mask)
+{
+	return mask ? -ENOSYS : 0;
+}
+
+u64 __weak perf_reg_abi(struct task_struct *task)
+{
+	return PERF_SAMPLE_REGS_ABI_NONE;
+}
+
+void __weak perf_get_regs_user(struct perf_regs *regs_user,
+			       struct pt_regs *regs)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
+
 static void
 perf_output_sample_regs(struct perf_output_handle *handle,
 			struct pt_regs *regs, u64 mask)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 05/22] perf/x86: Use x86_perf_regs in the x86 nmi handler
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (3 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 04/22] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-10 18:40   ` Peter Zijlstra
  2026-02-09  7:20 ` [Patch v6 06/22] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data() Dapeng Mi
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

More and more regs will be supported in the overflow, e.g., more vector
registers, SSP, etc. The generic pt_regs struct cannot store all of
them. Use a X86 specific x86_perf_regs instead.

The struct pt_regs *regs is still passed to x86_pmu_handle_irq(). There
is no functional change for the existing code.

AMD IBS's NMI handler doesn't utilize the static call
x86_pmu_handle_irq(). The x86_perf_regs struct doesn't apply to the AMD
IBS. It can be added separately later when AMD IBS supports more regs.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6df73e8398cd..8c80d22864d8 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1785,6 +1785,7 @@ EXPORT_SYMBOL_FOR_KVM(perf_put_guest_lvtpc);
 static int
 perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
+	struct x86_perf_regs x86_regs;
 	u64 start_clock;
 	u64 finish_clock;
 	int ret;
@@ -1808,7 +1809,8 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 		return NMI_DONE;
 
 	start_clock = sched_clock();
-	ret = static_call(x86_pmu_handle_irq)(regs);
+	x86_regs.regs = *regs;
+	ret = static_call(x86_pmu_handle_irq)(&x86_regs.regs);
 	finish_clock = sched_clock();
 
 	perf_sample_event_took(finish_clock - start_clock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 06/22] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data()
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (4 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 05/22] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 07/22] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The current perf/x86 implementation uses the generic functions
perf_sample_regs_user() and perf_sample_regs_intr() to set up registers
data for sampling records. While this approach works for general
registers, it falls short when adding sampling support for SIMD and APX
eGPRs registers on x86 platforms.

To address this, we introduce the x86-specific function
x86_pmu_setup_regs_data() for setting up register data on x86 platforms.

At present, x86_pmu_setup_regs_data() mirrors the logic of the generic
functions perf_sample_regs_user() and perf_sample_regs_intr().
Subsequent patches will introduce x86-specific enhancements.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c       | 33 +++++++++++++++++++++++++++++++++
 arch/x86/events/intel/ds.c   |  9 ++++++---
 arch/x86/events/perf_event.h |  4 ++++
 3 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 8c80d22864d8..d0753592a75b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1699,6 +1699,39 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs)
+{
+	struct perf_event_attr *attr = &event->attr;
+	u64 sample_type = attr->sample_type;
+
+	if (sample_type & PERF_SAMPLE_REGS_USER) {
+		if (user_mode(regs)) {
+			data->regs_user.abi = perf_reg_abi(current);
+			data->regs_user.regs = regs;
+		} else if (!(current->flags & PF_KTHREAD)) {
+			perf_get_regs_user(&data->regs_user, regs);
+		} else {
+			data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
+			data->regs_user.regs = NULL;
+		}
+		data->dyn_size += sizeof(u64);
+		if (data->regs_user.regs)
+			data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
+		data->sample_flags |= PERF_SAMPLE_REGS_USER;
+	}
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR) {
+		data->regs_intr.regs = regs;
+		data->regs_intr.abi = perf_reg_abi(current);
+		data->dyn_size += sizeof(u64);
+		if (data->regs_intr.regs)
+			data->dyn_size += hweight64(attr->sample_regs_intr) * sizeof(u64);
+		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
+	}
+}
+
 int x86_pmu_handle_irq(struct pt_regs *regs)
 {
 	struct perf_sample_data data;
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 87bf8672f5a8..07c2a670ba02 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2445,6 +2445,7 @@ static inline void __setup_pebs_basic_group(struct perf_event *event,
 }
 
 static inline void __setup_pebs_gpr_group(struct perf_event *event,
+					  struct perf_sample_data *data,
 					  struct pt_regs *regs,
 					  struct pebs_gprs *gprs,
 					  u64 sample_type)
@@ -2454,8 +2455,10 @@ static inline void __setup_pebs_gpr_group(struct perf_event *event,
 		regs->flags &= ~PERF_EFLAGS_EXACT;
 	}
 
-	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
+	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
 		adaptive_pebs_save_regs(regs, gprs);
+		x86_pmu_setup_regs_data(event, data, regs);
+	}
 }
 
 static inline void __setup_pebs_meminfo_group(struct perf_event *event,
@@ -2548,7 +2551,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		gprs = next_record;
 		next_record = gprs + 1;
 
-		__setup_pebs_gpr_group(event, regs, gprs, sample_type);
+		__setup_pebs_gpr_group(event, data, regs, gprs, sample_type);
 	}
 
 	if (format_group & PEBS_DATACFG_MEMINFO) {
@@ -2672,7 +2675,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		gprs = next_record;
 		next_record = gprs + 1;
 
-		__setup_pebs_gpr_group(event, regs,
+		__setup_pebs_gpr_group(event, data, regs,
 				       (struct pebs_gprs *)gprs,
 				       sample_type);
 	}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index cd337f3ffd01..d9ebea3ebee5 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1306,6 +1306,10 @@ void x86_pmu_enable_event(struct perf_event *event);
 
 int x86_pmu_handle_irq(struct pt_regs *regs);
 
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs);
+
 void x86_pmu_show_pmu_cap(struct pmu *pmu);
 
 static inline int x86_pmu_num_counters(struct pmu *pmu)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 07/22] x86/fpu/xstate: Add xsaves_nmi() helper
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (5 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 06/22] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data() Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state Dapeng Mi
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Add xsaves_nmi() to save supported xsave states in NMI handler.

This function is similar to xsaves(), but should only be called within
a NMI handler. This function returns the actual register contents at
the moment the NMI occurs.

Currently the perf subsystem is the sole user of this helper. It uses
this function to snapshot SIMD (XMM/YMM/ZMM) and APX eGPRs registers
which would be added in subsequent patches.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/fpu/xstate.h |  1 +
 arch/x86/kernel/fpu/xstate.c      | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 7a7dc9d56027..38fa8ff26559 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -110,6 +110,7 @@ int xfeature_size(int xfeature_nr);
 
 void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
+void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
 int xfd_enable_feature(u64 xfd_err);
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 48113c5193aa..33e9a4562943 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1475,6 +1475,29 @@ void xrstors(struct xregs_state *xstate, u64 mask)
 	WARN_ON_ONCE(err);
 }
 
+/**
+ * xsaves_nmi - Save selected components to a kernel xstate buffer in NMI
+ * @xstate:	Pointer to the buffer
+ * @mask:	Feature mask to select the components to save
+ *
+ * This function is similar to xsaves(), but should only be called within
+ * a NMI handler. This function returns the actual register contents at
+ * the moment the NMI occurs.
+ *
+ * Currently, the perf subsystem is the sole user of this helper. It uses
+ * the function to snapshot SIMD (XMM/YMM/ZMM) and APX eGPRs registers.
+ */
+void xsaves_nmi(struct xregs_state *xstate, u64 mask)
+{
+	int err;
+
+	if (!in_nmi())
+		return;
+
+	XSTATE_OP(XSAVES, xstate, (u32)mask, (u32)(mask >> 32), err);
+	WARN_ON_ONCE(err);
+}
+
 #if IS_ENABLED(CONFIG_KVM)
 void fpstate_clear_xstate_component(struct fpstate *fpstate, unsigned int xfeature)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (6 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 07/22] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-11 19:39   ` Chang S. Bae
  2026-02-09  7:20 ` [Patch v6 09/22] perf: Move and rename has_extended_regs() for ARCH-specific use Dapeng Mi
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Ensure that the TIF_NEED_FPU_LOAD flag is always set after saving the
FPU state. This guarantees that the user space FPU state has been saved
whenever the TIF_NEED_FPU_LOAD flag is set.

A subsequent patch will verify if the user space FPU state can be
retrieved from the saved task FPU state in the NMI context by checking
the TIF_NEED_FPU_LOAD flag.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

v6: new patch.

 arch/x86/include/asm/fpu/sched.h |  2 +-
 arch/x86/kernel/fpu/core.c       | 12 ++++++++----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index 89004f4ca208..2d57a7bf5406 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -36,8 +36,8 @@ static inline void switch_fpu(struct task_struct *old, int cpu)
 	    !(old->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		struct fpu *old_fpu = x86_task_fpu(old);
 
-		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
 		save_fpregs_to_fpstate(old_fpu);
+		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
 		/*
 		 * The save operation preserved register state, so the
 		 * fpu_fpregs_owner_ctx is still @old_fpu. Store the
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index da233f20ae6f..0f91a0d7e799 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -359,18 +359,22 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 	struct fpstate *cur_fps = fpu->fpstate;
 
 	fpregs_lock();
-	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD))
+	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD)) {
 		save_fpregs_to_fpstate(fpu);
+		set_thread_flag(TIF_NEED_FPU_LOAD);
+	}
 
 	/* Swap fpstate */
 	if (enter_guest) {
-		fpu->__task_fpstate = cur_fps;
+		WRITE_ONCE(fpu->__task_fpstate, cur_fps);
+		barrier();
 		fpu->fpstate = guest_fps;
 		guest_fps->in_use = true;
 	} else {
 		guest_fps->in_use = false;
 		fpu->fpstate = fpu->__task_fpstate;
-		fpu->__task_fpstate = NULL;
+		barrier();
+		WRITE_ONCE(fpu->__task_fpstate, NULL);
 	}
 
 	cur_fps = fpu->fpstate;
@@ -456,8 +460,8 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 
 	if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
 	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
-		set_thread_flag(TIF_NEED_FPU_LOAD);
 		save_fpregs_to_fpstate(x86_task_fpu(current));
+		set_thread_flag(TIF_NEED_FPU_LOAD);
 	}
 	__cpu_invalidate_fpregs_state();
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 09/22] perf: Move and rename has_extended_regs() for ARCH-specific use
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (7 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events Dapeng Mi
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The has_extended_regs() function will be utilized in ARCH-specific code.
To facilitate this, move it to header file perf_event.h

Additionally, the function is renamed to event_has_extended_regs() which
aligns with the existing naming conventions.

No functional change intended.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 include/linux/perf_event.h | 8 ++++++++
 kernel/events/core.c       | 8 +-------
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 82e617fad165..b8a0f77412b3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1534,6 +1534,14 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_extended_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return (attr->sample_regs_user & PERF_REG_EXTENDED_MASK) ||
+	       (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK);
+}
+
 static inline bool event_has_any_exclude_flag(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8410b1a7ef3b..d487c55a4f3e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12964,12 +12964,6 @@ int perf_pmu_unregister(struct pmu *pmu)
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
-static inline bool has_extended_regs(struct perf_event *event)
-{
-	return (event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK) ||
-	       (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK);
-}
-
 static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 {
 	struct perf_event_context *ctx = NULL;
@@ -13004,7 +12998,7 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 		goto err_pmu;
 
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
-	    has_extended_regs(event)) {
+	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
 		goto err_destroy;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (8 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 09/22] perf: Move and rename has_extended_regs() for ARCH-specific use Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-15 23:58   ` Chang S. Bae
  2026-02-09  7:20 ` [Patch v6 11/22] perf/x86: Enable XMM register sampling for REGS_USER case Dapeng Mi
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi, Kan Liang

Previously, XMM register sampling was only available for PEBS events
starting from Icelake. This patch extends support to non-PEBS events by
utilizing the `xsaves` instruction, thereby completing the feature set.

To implement this, a 64-byte aligned buffer is required. A per-CPU
`ext_regs_buf` is introduced to store SIMD and other registers, with an
approximate size of 2K. The buffer is allocated using `kzalloc_node()`,
ensuring natural and 64-byte alignment for all `kmalloc()` allocations
with powers of 2.

This patch supports XMM sampling for non-PEBS events in the `REGS_INTR`
case. Support for `REGS_USER` will be added in a subsequent patch. For
PEBS events, XMM register sampling data is directly retrieved from PEBS
records.

Future support for additional vector registers (YMM/ZMM/OPMASK) is
planned. An `ext_regs_mask` is added to track the supported vector
register groups.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V6: functions name refine.

 arch/x86/events/core.c            | 147 +++++++++++++++++++++++++++---
 arch/x86/events/intel/core.c      |  29 +++++-
 arch/x86/events/intel/ds.c        |  20 ++--
 arch/x86/events/perf_event.h      |  11 ++-
 arch/x86/include/asm/fpu/xstate.h |   2 +
 arch/x86/include/asm/perf_event.h |   5 +-
 arch/x86/kernel/fpu/xstate.c      |   2 +-
 7 files changed, 191 insertions(+), 25 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d0753592a75b..3c0987e13edc 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -410,6 +410,45 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event)
 	return x86_pmu_extra_regs(val, event);
 }
 
+static DEFINE_PER_CPU(struct xregs_state *, ext_regs_buf);
+
+static void release_ext_regs_buffers(void)
+{
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		kfree(per_cpu(ext_regs_buf, cpu));
+		per_cpu(ext_regs_buf, cpu) = NULL;
+	}
+}
+
+static void reserve_ext_regs_buffers(void)
+{
+	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
+	unsigned int size;
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	size = xstate_calculate_size(x86_pmu.ext_regs_mask, compacted);
+
+	for_each_possible_cpu(cpu) {
+		per_cpu(ext_regs_buf, cpu) = kzalloc_node(size, GFP_KERNEL,
+							  cpu_to_node(cpu));
+		if (!per_cpu(ext_regs_buf, cpu))
+			goto err;
+	}
+
+	return;
+
+err:
+	release_ext_regs_buffers();
+}
+
 int x86_reserve_hardware(void)
 {
 	int err = 0;
@@ -422,6 +461,7 @@ int x86_reserve_hardware(void)
 			} else {
 				reserve_ds_buffers();
 				reserve_lbr_buffers();
+				reserve_ext_regs_buffers();
 			}
 		}
 		if (!err)
@@ -438,6 +478,7 @@ void x86_release_hardware(void)
 		release_pmc_hardware();
 		release_ds_buffers();
 		release_lbr_buffers();
+		release_ext_regs_buffers();
 		mutex_unlock(&pmc_reserve_mutex);
 	}
 }
@@ -655,18 +696,23 @@ int x86_pmu_hw_config(struct perf_event *event)
 			return -EINVAL;
 	}
 
-	/* sample_regs_user never support XMM registers */
-	if (unlikely(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK))
-		return -EINVAL;
-	/*
-	 * Besides the general purpose registers, XMM registers may
-	 * be collected in PEBS on some platforms, e.g. Icelake
-	 */
-	if (unlikely(event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK)) {
-		if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
-			return -EINVAL;
+	if (event->attr.sample_type & PERF_SAMPLE_REGS_INTR) {
+		/*
+		 * Besides the general purpose registers, XMM registers may
+		 * be collected as well.
+		 */
+		if (event_has_extended_regs(event)) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+				return -EINVAL;
+		}
+	}
 
-		if (!event->attr.precise_ip)
+	if (event->attr.sample_type & PERF_SAMPLE_REGS_USER) {
+		/*
+		 * Currently XMM registers sampling for REGS_USER is not
+		 * supported yet.
+		 */
+		if (event_has_extended_regs(event))
 			return -EINVAL;
 	}
 
@@ -1699,9 +1745,9 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
-void x86_pmu_setup_regs_data(struct perf_event *event,
-			     struct perf_sample_data *data,
-			     struct pt_regs *regs)
+static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
+					  struct perf_sample_data *data,
+					  struct pt_regs *regs)
 {
 	struct perf_event_attr *attr = &event->attr;
 	u64 sample_type = attr->sample_type;
@@ -1732,6 +1778,79 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 	}
 }
 
+inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
+{
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+	perf_regs->xmm_regs = NULL;
+}
+
+static inline void __x86_pmu_sample_ext_regs(u64 mask)
+{
+	struct xregs_state *xsave = per_cpu(ext_regs_buf, smp_processor_id());
+
+	if (WARN_ON_ONCE(!xsave))
+		return;
+
+	xsaves_nmi(xsave, mask);
+}
+
+static inline void x86_pmu_update_ext_regs(struct x86_perf_regs *perf_regs,
+					   struct xregs_state *xsave, u64 bitmap)
+{
+	u64 mask;
+
+	if (!xsave)
+		return;
+
+	/* Filtered by what XSAVE really gives */
+	mask = bitmap & xsave->header.xfeatures;
+
+	if (mask & XFEATURE_MASK_SSE)
+		perf_regs->xmm_space = xsave->i387.xmm_space;
+}
+
+static void x86_pmu_sample_extended_regs(struct perf_event *event,
+					 struct perf_sample_data *data,
+					 struct pt_regs *regs,
+					 u64 ignore_mask)
+{
+	u64 sample_type = event->attr.sample_type;
+	struct x86_perf_regs *perf_regs;
+	struct xregs_state *xsave;
+	u64 intr_mask = 0;
+	u64 mask = 0;
+
+	perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+	if (event_has_extended_regs(event))
+		mask |= XFEATURE_MASK_SSE;
+
+	mask &= x86_pmu.ext_regs_mask;
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR)
+		intr_mask = mask & ~ignore_mask;
+
+	if (intr_mask) {
+		__x86_pmu_sample_ext_regs(intr_mask);
+		xsave = per_cpu(ext_regs_buf, smp_processor_id());
+		x86_pmu_update_ext_regs(perf_regs, xsave, intr_mask);
+	}
+}
+
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs,
+			     u64 ignore_mask)
+{
+	x86_pmu_setup_basic_regs_data(event, data, regs);
+	/*
+	 * ignore_mask indicates the PEBS sampled extended regs
+	 * which may be unnecessary to sample again.
+	 */
+	x86_pmu_sample_extended_regs(event, data, regs, ignore_mask);
+}
+
 int x86_pmu_handle_irq(struct pt_regs *regs)
 {
 	struct perf_sample_data data;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 5ed26b83c61d..ae7693e586d3 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3651,6 +3651,9 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 		if (has_branch_stack(event))
 			intel_pmu_lbr_save_brstack(&data, cpuc, event);
 
+		x86_pmu_clear_perf_regs(regs);
+		x86_pmu_setup_regs_data(event, &data, regs, 0);
+
 		perf_event_overflow(event, &data, regs);
 	}
 
@@ -5880,8 +5883,30 @@ static inline void __intel_update_large_pebs_flags(struct pmu *pmu)
 	}
 }
 
-#define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
+static void intel_extended_regs_init(struct pmu *pmu)
+{
+	/*
+	 * Extend the vector registers support to non-PEBS.
+	 * The feature is limited to newer Intel machines with
+	 * PEBS V4+ or archPerfmonExt (0x23) enabled for now.
+	 * In theory, the vector registers can be retrieved as
+	 * long as the CPU supports. The support for the old
+	 * generations may be added later if there is a
+	 * requirement.
+	 * Only support the extension when XSAVES is available.
+	 */
+	if (!boot_cpu_has(X86_FEATURE_XSAVES))
+		return;
 
+	if (!boot_cpu_has(X86_FEATURE_XMM) ||
+	    !cpu_has_xfeatures(XFEATURE_MASK_SSE, NULL))
+		return;
+
+	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
+	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+}
+
+#define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
 static void update_pmu_cap(struct pmu *pmu)
 {
 	unsigned int eax, ebx, ecx, edx;
@@ -5945,6 +5970,8 @@ static void update_pmu_cap(struct pmu *pmu)
 		/* Perf Metric (Bit 15) and PEBS via PT (Bit 16) are hybrid enumeration */
 		rdmsrq(MSR_IA32_PERF_CAPABILITIES, hybrid(pmu, intel_cap).capabilities);
 	}
+
+	intel_extended_regs_init(pmu);
 }
 
 static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 07c2a670ba02..229dbe368b65 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1735,8 +1735,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
-	    (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
+	if (event_has_extended_regs(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
@@ -2455,10 +2454,8 @@ static inline void __setup_pebs_gpr_group(struct perf_event *event,
 		regs->flags &= ~PERF_EFLAGS_EXACT;
 	}
 
-	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
+	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
 		adaptive_pebs_save_regs(regs, gprs);
-		x86_pmu_setup_regs_data(event, data, regs);
-	}
 }
 
 static inline void __setup_pebs_meminfo_group(struct perf_event *event,
@@ -2516,6 +2513,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 	struct pebs_meminfo *meminfo = NULL;
 	struct pebs_gprs *gprs = NULL;
 	struct x86_perf_regs *perf_regs;
+	u64 ignore_mask = 0;
 	u64 format_group;
 	u16 retire;
 
@@ -2523,7 +2521,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		return;
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
-	perf_regs->xmm_regs = NULL;
+	x86_pmu_clear_perf_regs(regs);
 
 	format_group = basic->format_group;
 
@@ -2570,6 +2568,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 	if (format_group & PEBS_DATACFG_XMMS) {
 		struct pebs_xmm *xmm = next_record;
 
+		ignore_mask |= XFEATURE_MASK_SSE;
 		next_record = xmm + 1;
 		perf_regs->xmm_regs = xmm->xmm;
 	}
@@ -2608,6 +2607,8 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		next_record += nr * sizeof(u64);
 	}
 
+	x86_pmu_setup_regs_data(event, data, regs, ignore_mask);
+
 	WARN_ONCE(next_record != __pebs + basic->format_size,
 			"PEBS record size %u, expected %llu, config %llx\n",
 			basic->format_size,
@@ -2633,6 +2634,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 	struct arch_pebs_aux *meminfo = NULL;
 	struct arch_pebs_gprs *gprs = NULL;
 	struct x86_perf_regs *perf_regs;
+	u64 ignore_mask = 0;
 	void *next_record;
 	void *at = __pebs;
 
@@ -2640,7 +2642,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		return;
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
-	perf_regs->xmm_regs = NULL;
+	x86_pmu_clear_perf_regs(regs);
 
 	__setup_perf_sample_data(event, iregs, data);
 
@@ -2695,6 +2697,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 
 		next_record += sizeof(struct arch_pebs_xer_header);
 
+		ignore_mask |= XFEATURE_MASK_SSE;
 		xmm = next_record;
 		perf_regs->xmm_regs = xmm->xmm;
 		next_record = xmm + 1;
@@ -2742,6 +2745,8 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		at = at + header->size;
 		goto again;
 	}
+
+	x86_pmu_setup_regs_data(event, data, regs, ignore_mask);
 }
 
 static inline void *
@@ -3404,6 +3409,7 @@ static void __init intel_ds_pebs_init(void)
 				x86_pmu.flags |= PMU_FL_PEBS_ALL;
 				x86_pmu.pebs_capable = ~0ULL;
 				pebs_qual = "-baseline";
+				x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
 				x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
 			} else {
 				/* Only basic record supported */
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index d9ebea3ebee5..a32ee4f0c891 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1020,6 +1020,12 @@ struct x86_pmu {
 	struct extra_reg *extra_regs;
 	unsigned int flags;
 
+	/*
+	 * Extended regs, e.g., vector registers
+	 * Utilize the same format as the XFEATURE_MASK_*
+	 */
+	u64		ext_regs_mask;
+
 	/*
 	 * Intel host/guest support (KVM)
 	 */
@@ -1306,9 +1312,12 @@ void x86_pmu_enable_event(struct perf_event *event);
 
 int x86_pmu_handle_irq(struct pt_regs *regs);
 
+void x86_pmu_clear_perf_regs(struct pt_regs *regs);
+
 void x86_pmu_setup_regs_data(struct perf_event *event,
 			     struct perf_sample_data *data,
-			     struct pt_regs *regs);
+			     struct pt_regs *regs,
+			     u64 ignore_mask);
 
 void x86_pmu_show_pmu_cap(struct pmu *pmu);
 
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 38fa8ff26559..19dec5f0b1c7 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -112,6 +112,8 @@ void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
 void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
+
 int xfd_enable_feature(u64 xfd_err);
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index ff5acb8b199b..7baa1b0f889f 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -709,7 +709,10 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
-	u64		*xmm_regs;
+	union {
+		u64	*xmm_regs;
+		u32	*xmm_space;	/* for xsaves */
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 33e9a4562943..7a98769d7ea0 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -587,7 +587,7 @@ static bool __init check_xstate_against_struct(int nr)
 	return true;
 }
 
-static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
 {
 	unsigned int topmost = fls64(xfeatures) -  1;
 	unsigned int offset, i;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 11/22] perf/x86: Enable XMM register sampling for REGS_USER case
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (9 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 12/22] perf: Add sampling support for SIMD registers Dapeng Mi
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi, Kan Liang

This patch adds support for XMM register sampling in the REGS_USER case.

To handle simultaneous sampling of XMM registers for both REGS_INTR and
REGS_USER cases, a per-CPU `x86_user_regs` is introduced to store
REGS_USER-specific XMM registers. This prevents REGS_USER-specific XMM
register data from being overwritten by REGS_INTR-specific data if they
share the same `x86_perf_regs` structure.

To sample user-space XMM registers, the `x86_pmu_update_user_ext_regs()`
helper function is added. It checks if the `TIF_NEED_FPU_LOAD` flag is
set. If so, the user-space XMM register data can be directly retrieved
from the cached task FPU state, as the corresponding hardware registers
have been cleared or switched to kernel-space data. Otherwise, the data
must be read from the hardware registers using the `xsaves` instruction.

For PEBS events, `x86_pmu_update_user_ext_regs()` checks if the
PEBS-sampled XMM register data belongs to user-space. If so, no further
action is needed. Otherwise, the user-space XMM register data needs to be
re-sampled using the same method as for non-PEBS events.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V6: New patch, partly split from previous patch. Fully support user-regs
    sampling for SIMD regsiters as Peter suggested.

 arch/x86/events/core.c | 99 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 85 insertions(+), 14 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 3c0987e13edc..36b4bc413938 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -696,7 +696,7 @@ int x86_pmu_hw_config(struct perf_event *event)
 			return -EINVAL;
 	}
 
-	if (event->attr.sample_type & PERF_SAMPLE_REGS_INTR) {
+	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
 		/*
 		 * Besides the general purpose registers, XMM registers may
 		 * be collected as well.
@@ -707,15 +707,6 @@ int x86_pmu_hw_config(struct perf_event *event)
 		}
 	}
 
-	if (event->attr.sample_type & PERF_SAMPLE_REGS_USER) {
-		/*
-		 * Currently XMM registers sampling for REGS_USER is not
-		 * supported yet.
-		 */
-		if (event_has_extended_regs(event))
-			return -EINVAL;
-	}
-
 	return x86_setup_perfctr(event);
 }
 
@@ -1745,6 +1736,28 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
+/*
+ * When both PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER are set,
+ * an additional x86_perf_regs is required to save user-space registers.
+ * Without this, user-space register data may be overwritten by kernel-space
+ * registers.
+ */
+static DEFINE_PER_CPU(struct x86_perf_regs, x86_user_regs);
+static void x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
+				       struct pt_regs *regs)
+{
+	struct x86_perf_regs *x86_regs_user = this_cpu_ptr(&x86_user_regs);
+	struct perf_regs regs_user;
+
+	perf_get_regs_user(&regs_user, regs);
+	data->regs_user.abi = regs_user.abi;
+	if (regs_user.regs) {
+		x86_regs_user->regs = *regs_user.regs;
+		data->regs_user.regs = &x86_regs_user->regs;
+	} else
+		data->regs_user.regs = NULL;
+}
+
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
 					  struct perf_sample_data *data,
 					  struct pt_regs *regs)
@@ -1757,7 +1770,14 @@ static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
 			data->regs_user.abi = perf_reg_abi(current);
 			data->regs_user.regs = regs;
 		} else if (!(current->flags & PF_KTHREAD)) {
-			perf_get_regs_user(&data->regs_user, regs);
+			/*
+			 * It cannot guarantee that the kernel will never
+			 * touch the registers outside of the pt_regs,
+			 * especially when more and more registers
+			 * (e.g., SIMD, eGPR) are added. The live data
+			 * cannot be used.
+			 */
+			x86_pmu_perf_get_regs_user(data, regs);
 		} else {
 			data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
 			data->regs_user.regs = NULL;
@@ -1810,6 +1830,47 @@ static inline void x86_pmu_update_ext_regs(struct x86_perf_regs *perf_regs,
 		perf_regs->xmm_space = xsave->i387.xmm_space;
 }
 
+/*
+ * This function retrieves cached user-space fpu registers (XMM/YMM/ZMM).
+ * If TIF_NEED_FPU_LOAD is set, it indicates that the user-space FPU state
+ * Otherwise, the data should be read directly from the hardware registers.
+ */
+static inline u64 x86_pmu_update_user_ext_regs(struct perf_sample_data *data,
+					       struct pt_regs *regs,
+					       u64 mask, u64 ignore_mask)
+{
+	struct x86_perf_regs *perf_regs;
+	struct xregs_state *xsave;
+	struct fpu *fpu;
+	struct fpstate *fps;
+	u64 sample_mask = 0;
+
+	if (data->regs_user.abi == PERF_SAMPLE_REGS_ABI_NONE)
+		return 0;
+
+	if (user_mode(regs))
+		sample_mask = mask & ~ignore_mask;
+
+	if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
+		perf_regs = container_of(data->regs_user.regs,
+				 struct x86_perf_regs, regs);
+		fpu = x86_task_fpu(current);
+		/*
+		 * If __task_fpstate is set, it holds the right pointer,
+		 * otherwise fpstate will.
+		 */
+		fps = READ_ONCE(fpu->__task_fpstate);
+		if (!fps)
+			fps = fpu->fpstate;
+		xsave = &fps->regs.xsave;
+
+		x86_pmu_update_ext_regs(perf_regs, xsave, mask);
+		sample_mask = 0;
+	}
+
+	return sample_mask;
+}
+
 static void x86_pmu_sample_extended_regs(struct perf_event *event,
 					 struct perf_sample_data *data,
 					 struct pt_regs *regs,
@@ -1818,6 +1879,7 @@ static void x86_pmu_sample_extended_regs(struct perf_event *event,
 	u64 sample_type = event->attr.sample_type;
 	struct x86_perf_regs *perf_regs;
 	struct xregs_state *xsave;
+	u64 user_mask = 0;
 	u64 intr_mask = 0;
 	u64 mask = 0;
 
@@ -1827,15 +1889,24 @@ static void x86_pmu_sample_extended_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_SSE;
 
 	mask &= x86_pmu.ext_regs_mask;
+	if (sample_type & PERF_SAMPLE_REGS_USER) {
+		user_mask = x86_pmu_update_user_ext_regs(data, regs,
+							 mask, ignore_mask);
+	}
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR)
 		intr_mask = mask & ~ignore_mask;
 
-	if (intr_mask) {
-		__x86_pmu_sample_ext_regs(intr_mask);
+	if (user_mask | intr_mask) {
+		__x86_pmu_sample_ext_regs(user_mask | intr_mask);
 		xsave = per_cpu(ext_regs_buf, smp_processor_id());
-		x86_pmu_update_ext_regs(perf_regs, xsave, intr_mask);
 	}
+
+	if (user_mask)
+		x86_pmu_update_ext_regs(perf_regs, xsave, user_mask);
+
+	if (intr_mask)
+		x86_pmu_update_ext_regs(perf_regs, xsave, intr_mask);
 }
 
 void x86_pmu_setup_regs_data(struct perf_event *event,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 12/22] perf: Add sampling support for SIMD registers
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (10 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 11/22] perf/x86: Enable XMM register sampling for REGS_USER case Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-10 20:04   ` Peter Zijlstra
  2026-02-09  7:20 ` [Patch v6 13/22] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Users may be interested in sampling SIMD registers during profiling.
The current sample_regs_* structure does not have sufficient space
for all SIMD registers.

To address this, new attribute fields sample_simd_{pred,vec}_reg_* are
added to struct perf_event_attr to represent the SIMD registers that are
expected to be sampled.

Currently, the perf/x86 code supports XMM registers in sample_regs_*.
To unify the configuration of SIMD registers and ensure a consistent
method for configuring XMM and other SIMD registers, a new event
attribute field, sample_simd_regs_enabled, is introduced. When
sample_simd_regs_enabled is set, it indicates that all SIMD registers,
including XMM, will be represented by the newly introduced
sample_simd_{pred|vec}_reg_* fields. The original XMM space in
sample_regs_* is reserved for future uses.

Since SIMD registers are wider than 64 bits, a new output format is
introduced. The number and width of SIMD registers are dumped first,
followed by the register values. The number and width are based on the
user's configuration. If they differ (e.g., on ARM), an ARCH-specific
perf_output_sample_simd_regs function can be implemented separately.

A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is added to indicate the new format.
The enum perf_sample_regs_abi is now a bitmap. This change should not
impact existing tools, as the version and bitmap remain the same for
values 1 and 2.

Additionally, two new __weak functions are introduced:
- perf_simd_reg_value(): Retrieves the value of the requested SIMD
  register.
- perf_simd_reg_validate(): Validates the configuration of the SIMD
  registers.

A new flag, PERF_PMU_CAP_SIMD_REGS, is added to indicate that the PMU
supports SIMD register dumping. An error is generated if
sample_simd_{pred|vec}_reg_* is mistakenly set for a PMU that does not
support this capability.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V6: Adjust newly added fields in perf_event_attr to avoid memory holes

 include/linux/perf_event.h      |  8 +++
 include/linux/perf_regs.h       |  4 ++
 include/uapi/linux/perf_event.h | 45 ++++++++++++++--
 kernel/events/core.c            | 96 +++++++++++++++++++++++++++++++--
 4 files changed, 146 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b8a0f77412b3..172ba199d4ff 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -306,6 +306,7 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_AUX_PAUSE		0x0200
 #define PERF_PMU_CAP_AUX_PREFER_LARGE	0x0400
 #define PERF_PMU_CAP_MEDIATED_VPMU	0x0800
+#define PERF_PMU_CAP_SIMD_REGS		0x1000
 
 /**
  * pmu::scope
@@ -1534,6 +1535,13 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_simd_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return attr->sample_simd_regs_enabled != 0;
+}
+
 static inline bool event_has_extended_regs(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 144bcc3ff19f..518f28c6a7d4 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -14,6 +14,10 @@ int perf_reg_validate(u64 mask);
 u64 perf_reg_abi(struct task_struct *task);
 void perf_get_regs_user(struct perf_regs *regs_user,
 			struct pt_regs *regs);
+int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+			   u16 pred_qwords, u32 pred_mask);
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred);
 
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 533393ec94d0..b41ae1b82344 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -314,8 +314,9 @@ enum {
  */
 enum perf_sample_regs_abi {
 	PERF_SAMPLE_REGS_ABI_NONE		= 0,
-	PERF_SAMPLE_REGS_ABI_32			= 1,
-	PERF_SAMPLE_REGS_ABI_64			= 2,
+	PERF_SAMPLE_REGS_ABI_32			= (1 << 0),
+	PERF_SAMPLE_REGS_ABI_64			= (1 << 1),
+	PERF_SAMPLE_REGS_ABI_SIMD		= (1 << 2),
 };
 
 /*
@@ -383,6 +384,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
 #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
 #define PERF_ATTR_SIZE_VER9			144	/* add: config4 */
+#define PERF_ATTR_SIZE_VER10			176	/* Add: sample_simd_{pred,vec}_reg_* */
 
 /*
  * 'struct perf_event_attr' contains various attributes that define
@@ -547,6 +549,25 @@ struct perf_event_attr {
 
 	__u64	config3; /* extension of config2 */
 	__u64	config4; /* extension of config3 */
+
+	/*
+	 * Defines set of SIMD registers to dump on samples.
+	 * The sample_simd_regs_enabled !=0 implies the
+	 * set of SIMD registers is used to config all SIMD registers.
+	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+	 * config some SIMD registers on X86.
+	 */
+	union {
+		__u16 sample_simd_regs_enabled;
+		__u16 sample_simd_pred_reg_qwords;
+	};
+	__u16	sample_simd_vec_reg_qwords;
+	__u32	__reserved_4;
+
+	__u32	sample_simd_pred_reg_intr;
+	__u32	sample_simd_pred_reg_user;
+	__u64	sample_simd_vec_reg_intr;
+	__u64	sample_simd_vec_reg_user;
 };
 
 /*
@@ -1020,7 +1041,15 @@ enum perf_event_type {
 	 *      } && PERF_SAMPLE_BRANCH_STACK
 	 *
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;		# 0 ... weight(sample_simd_vec_reg_user)
+	 *		u16 vector_qwords;	# 0 ... sample_simd_vec_reg_qwords
+	 *		u16 nr_pred;		# 0 ... weight(sample_simd_pred_reg_user)
+	 *		u16 pred_qwords;	# 0 ... sample_simd_pred_reg_qwords
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_USER
 	 *
 	 *	{ u64			size;
 	 *	  char			data[size];
@@ -1047,7 +1076,15 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;		# 0 ... weight(sample_simd_vec_reg_intr)
+	 *		u16 vector_qwords;	# 0 ... sample_simd_vec_reg_qwords
+	 *		u16 nr_pred;		# 0 ... weight(sample_simd_pred_reg_intr)
+	 *		u16 pred_qwords;	# 0 ... sample_simd_pred_reg_qwords
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d487c55a4f3e..5742126f50cc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7761,6 +7761,50 @@ perf_output_sample_regs(struct perf_output_handle *handle,
 	}
 }
 
+static void
+perf_output_sample_simd_regs(struct perf_output_handle *handle,
+			     struct perf_event *event,
+			     struct pt_regs *regs,
+			     u64 mask, u32 pred_mask)
+{
+	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
+	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
+	u16 nr_vectors;
+	u16 nr_pred;
+	int bit;
+	u64 val;
+	u16 i;
+
+	nr_vectors = hweight64(mask);
+	nr_pred = hweight32(pred_mask);
+
+	perf_output_put(handle, nr_vectors);
+	perf_output_put(handle, vec_qwords);
+	perf_output_put(handle, nr_pred);
+	perf_output_put(handle, pred_qwords);
+
+	if (nr_vectors) {
+		for (bit = 0; bit < sizeof(mask) * BITS_PER_BYTE; bit++) {
+			if (!(BIT_ULL(bit) & mask))
+				continue;
+			for (i = 0; i < vec_qwords; i++) {
+				val = perf_simd_reg_value(regs, bit, i, false);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+	if (nr_pred) {
+		for (bit = 0; bit < sizeof(pred_mask) * BITS_PER_BYTE; bit++) {
+			if (!(BIT(bit) & pred_mask))
+				continue;
+			for (i = 0; i < pred_qwords; i++) {
+				val = perf_simd_reg_value(regs, bit, i, true);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+}
+
 static void perf_sample_regs_user(struct perf_regs *regs_user,
 				  struct pt_regs *regs)
 {
@@ -7782,6 +7826,17 @@ static void perf_sample_regs_intr(struct perf_regs *regs_intr,
 	regs_intr->abi  = perf_reg_abi(current);
 }
 
+int __weak perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+				  u16 pred_qwords, u32 pred_mask)
+{
+	return vec_qwords || vec_mask || pred_qwords || pred_mask ? -ENOSYS : 0;
+}
+
+u64 __weak perf_simd_reg_value(struct pt_regs *regs, int idx,
+			       u16 qwords_idx, bool pred)
+{
+	return 0;
+}
 
 /*
  * Get remaining task size from user stack pointer.
@@ -8312,10 +8367,17 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_user;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_user;
 			perf_output_sample_regs(handle,
 						data->regs_user.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_user.regs,
+							     attr->sample_simd_vec_reg_user,
+							     attr->sample_simd_pred_reg_user);
+			}
 		}
 	}
 
@@ -8343,11 +8405,18 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_intr;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_intr;
 
 			perf_output_sample_regs(handle,
 						data->regs_intr.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_intr.regs,
+							     attr->sample_simd_vec_reg_intr,
+							     attr->sample_simd_pred_reg_intr);
+			}
 		}
 	}
 
@@ -12997,6 +13066,12 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 	if (ret)
 		goto err_pmu;
 
+	if (!(pmu->capabilities & PERF_PMU_CAP_SIMD_REGS) &&
+	    event_has_simd_regs(event)) {
+		ret = -EOPNOTSUPP;
+		goto err_destroy;
+	}
+
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
 	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
@@ -13542,6 +13617,12 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		ret = perf_reg_validate(attr->sample_regs_user);
 		if (ret)
 			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_user,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_user);
+		if (ret)
+			return ret;
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_STACK_USER) {
@@ -13562,8 +13643,17 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (!attr->sample_max_stack)
 		attr->sample_max_stack = sysctl_perf_event_max_stack;
 
-	if (attr->sample_type & PERF_SAMPLE_REGS_INTR)
+	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
 		ret = perf_reg_validate(attr->sample_regs_intr);
+		if (ret)
+			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_intr,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_intr);
+		if (ret)
+			return ret;
+	}
 
 #ifndef CONFIG_CGROUP_PERF
 	if (attr->sample_type & PERF_SAMPLE_CGROUP)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 13/22] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (11 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 12/22] perf: Add sampling support for SIMD registers Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 14/22] perf/x86: Enable YMM " Dapeng Mi
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for sampling XMM registers using the
sample_simd_vec_reg_* fields.

When sample_simd_regs_enabled is set, the original XMM space in the
sample_regs_* field is treated as reserved. An INVAL error will be
reported to user space if any bit is set in the original XMM space while
sample_simd_regs_enabled is set.

The perf_reg_value function requires ABI information to understand the
layout of sample_regs. To accommodate this, a new abi field is introduced
in the struct x86_perf_regs to represent ABI information.

Additionally, the X86-specific perf_simd_reg_value function is implemented
to retrieve the XMM register values.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V6: Remove some unnecessary marcos from perf_regs.h but not all. For the
marcos like PERF_X86_SIMD_*_REGS and PERF_X86_*_QWORDS, they are still
needed by both kernel and perf-tools and perf_regs.h seems to be the
best place to define them.

 arch/x86/events/core.c                | 90 +++++++++++++++++++++++++--
 arch/x86/events/intel/ds.c            |  2 +-
 arch/x86/events/perf_event.h          | 12 ++++
 arch/x86/include/asm/perf_event.h     |  1 +
 arch/x86/include/uapi/asm/perf_regs.h | 12 ++++
 arch/x86/kernel/perf_regs.c           | 51 ++++++++++++++-
 6 files changed, 161 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 36b4bc413938..bd47127fb84d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -704,6 +704,22 @@ int x86_pmu_hw_config(struct perf_event *event)
 		if (event_has_extended_regs(event)) {
 			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
 				return -EINVAL;
+			if (event->attr.sample_simd_regs_enabled)
+				return -EINVAL;
+		}
+
+		if (event_has_simd_regs(event)) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
+				return -EINVAL;
+			/* Not require any vector registers but set width */
+			if (event->attr.sample_simd_vec_reg_qwords &&
+			    !event->attr.sample_simd_vec_reg_intr &&
+			    !event->attr.sample_simd_vec_reg_user)
+				return -EINVAL;
+			/* The vector registers set is not supported */
+			if (event_needs_xmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
+				return -EINVAL;
 		}
 	}
 
@@ -1749,6 +1765,7 @@ static void x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
 	struct x86_perf_regs *x86_regs_user = this_cpu_ptr(&x86_user_regs);
 	struct perf_regs regs_user;
 
+	x86_regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
 	perf_get_regs_user(&regs_user, regs);
 	data->regs_user.abi = regs_user.abi;
 	if (regs_user.regs) {
@@ -1758,12 +1775,26 @@ static void x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
 		data->regs_user.regs = NULL;
 }
 
+static inline void
+x86_pmu_update_ext_regs_size(struct perf_event_attr *attr,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs,
+			     u64 mask, u64 pred_mask)
+{
+	u16 pred_qwords = attr->sample_simd_pred_reg_qwords;
+	u16 vec_qwords = attr->sample_simd_vec_reg_qwords;
+
+	data->dyn_size += (hweight64(mask) * vec_qwords +
+			   hweight64(pred_mask) * pred_qwords) * sizeof(u64);
+}
+
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
 					  struct perf_sample_data *data,
 					  struct pt_regs *regs)
 {
 	struct perf_event_attr *attr = &event->attr;
 	u64 sample_type = attr->sample_type;
+	struct x86_perf_regs *perf_regs;
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
 		if (user_mode(regs)) {
@@ -1783,8 +1814,13 @@ static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
 			data->regs_user.regs = NULL;
 		}
 		data->dyn_size += sizeof(u64);
-		if (data->regs_user.regs)
-			data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
+		if (data->regs_user.regs) {
+			data->dyn_size +=
+				hweight64(attr->sample_regs_user) * sizeof(u64);
+			perf_regs = container_of(data->regs_user.regs,
+						 struct x86_perf_regs, regs);
+			perf_regs->abi = data->regs_user.abi;
+		}
 		data->sample_flags |= PERF_SAMPLE_REGS_USER;
 	}
 
@@ -1792,8 +1828,13 @@ static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
 		data->regs_intr.regs = regs;
 		data->regs_intr.abi = perf_reg_abi(current);
 		data->dyn_size += sizeof(u64);
-		if (data->regs_intr.regs)
-			data->dyn_size += hweight64(attr->sample_regs_intr) * sizeof(u64);
+		if (data->regs_intr.regs) {
+			data->dyn_size +=
+				hweight64(attr->sample_regs_intr) * sizeof(u64);
+			perf_regs = container_of(data->regs_intr.regs,
+						 struct x86_perf_regs, regs);
+			perf_regs->abi = data->regs_intr.abi;
+		}
 		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
 	}
 }
@@ -1885,7 +1926,7 @@ static void x86_pmu_sample_extended_regs(struct perf_event *event,
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
 
-	if (event_has_extended_regs(event))
+	if (event_needs_xmm(event))
 		mask |= XFEATURE_MASK_SSE;
 
 	mask &= x86_pmu.ext_regs_mask;
@@ -1909,6 +1950,44 @@ static void x86_pmu_sample_extended_regs(struct perf_event *event,
 		x86_pmu_update_ext_regs(perf_regs, xsave, intr_mask);
 }
 
+static void x86_pmu_setup_extended_regs_data(struct perf_event *event,
+					     struct perf_sample_data *data,
+					     struct pt_regs *regs)
+{
+	struct perf_event_attr *attr = &event->attr;
+	u64 sample_type = attr->sample_type;
+	struct x86_perf_regs *perf_regs;
+
+	if (!attr->sample_simd_regs_enabled)
+		return;
+
+	if (sample_type & PERF_SAMPLE_REGS_USER && data->regs_user.abi) {
+		perf_regs = container_of(data->regs_user.regs,
+					 struct x86_perf_regs, regs);
+		perf_regs->abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+
+		/* num and qwords of vector and pred registers */
+		data->dyn_size += sizeof(u64);
+		data->regs_user.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		x86_pmu_update_ext_regs_size(attr, data, data->regs_user.regs,
+					     attr->sample_simd_vec_reg_user,
+					     attr->sample_simd_pred_reg_user);
+	}
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR && data->regs_intr.abi) {
+		perf_regs = container_of(data->regs_intr.regs,
+					 struct x86_perf_regs, regs);
+		perf_regs->abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+
+		/* num and qwords of vector and pred registers */
+		data->dyn_size += sizeof(u64);
+		data->regs_intr.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		x86_pmu_update_ext_regs_size(attr, data, data->regs_intr.regs,
+					     attr->sample_simd_vec_reg_intr,
+					     attr->sample_simd_pred_reg_intr);
+	}
+}
+
 void x86_pmu_setup_regs_data(struct perf_event *event,
 			     struct perf_sample_data *data,
 			     struct pt_regs *regs,
@@ -1920,6 +1999,7 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 	 * which may be unnecessary to sample again.
 	 */
 	x86_pmu_sample_extended_regs(event, data, regs, ignore_mask);
+	x86_pmu_setup_extended_regs_data(event, data, regs);
 }
 
 int x86_pmu_handle_irq(struct pt_regs *regs)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 229dbe368b65..272725d749df 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1735,7 +1735,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if (event_has_extended_regs(event))
+	if (event_needs_xmm(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index a32ee4f0c891..02eea137e261 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -137,6 +137,18 @@ static inline bool is_acr_event_group(struct perf_event *event)
 	return check_leader_group(event->group_leader, PERF_X86_EVENT_ACR);
 }
 
+static inline bool event_needs_xmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_XMM_QWORDS)
+		return true;
+
+	if (!event->attr.sample_simd_regs_enabled &&
+	    event_has_extended_regs(event))
+		return true;
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 7baa1b0f889f..1f172740916c 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -709,6 +709,7 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
+	u64		abi;
 	union {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..342b08448138 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -55,4 +55,16 @@ enum perf_event_x86_regs {
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
 
+enum {
+	PERF_X86_SIMD_XMM_REGS      = 16,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
+};
+
+#define PERF_X86_SIMD_VEC_MASK	GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
+
+enum {
+	PERF_X86_XMM_QWORDS      = 2,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
+};
+
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 81204cb7f723..9947a6b5c260 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -63,6 +63,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
+			return 0;
 		if (!perf_regs->xmm_regs)
 			return 0;
 		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
@@ -74,6 +77,51 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred)
+{
+	struct x86_perf_regs *perf_regs =
+			container_of(regs, struct x86_perf_regs, regs);
+
+	if (pred)
+		return 0;
+
+	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
+			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
+		return 0;
+
+	if (qwords_idx < PERF_X86_XMM_QWORDS) {
+		if (!perf_regs->xmm_regs)
+			return 0;
+		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS +
+					   qwords_idx];
+	}
+
+	return 0;
+}
+
+int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+			   u16 pred_qwords, u32 pred_mask)
+{
+	/* pred_qwords implies sample_simd_{pred,vec}_reg_* are supported */
+	if (!pred_qwords)
+		return 0;
+
+	if (!vec_qwords) {
+		if (vec_mask)
+			return -EINVAL;
+	} else {
+		if (vec_qwords != PERF_X86_XMM_QWORDS)
+			return -EINVAL;
+		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
+			return -EINVAL;
+	}
+	if (pred_mask)
+		return -EINVAL;
+
+	return 0;
+}
+
 #define PERF_REG_X86_RESERVED	(((1ULL << PERF_REG_X86_XMM0) - 1) & \
 				 ~((1ULL << PERF_REG_X86_MAX) - 1))
 
@@ -108,7 +156,8 @@ u64 perf_reg_abi(struct task_struct *task)
 
 int perf_reg_validate(u64 mask)
 {
-	if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
+	/* The mask could be 0 if only the SIMD registers are interested */
+	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
 		return -EINVAL;
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 14/22] perf/x86: Enable YMM sampling using sample_simd_vec_reg_* fields
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (12 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 13/22] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 15/22] perf/x86: Enable ZMM " Dapeng Mi
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch introduces support for sampling YMM registers via the
sample_simd_vec_reg_* fields.

Each YMM register consists of 4 u64 words, assembled from two halves:
XMM (the lower 2 u64 words) and YMMH (the upper 2 u64 words). Although
both XMM and YMMH data can be retrieved with a single xsaves instruction,
they are stored in separate locations. The perf_simd_reg_value() function
is responsible for assembling these halves into a complete YMM register
for output to userspace.

Additionally, sample_simd_vec_reg_qwords should be set to 4 to indicate
YMM sampling.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                |  8 ++++++++
 arch/x86/events/perf_event.h          |  9 +++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  6 ++++--
 arch/x86/kernel/perf_regs.c           | 10 +++++++++-
 5 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index bd47127fb84d..e80a392e30b0 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -720,6 +720,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_xmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
 				return -EINVAL;
+			if (event_needs_ymm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
+				return -EINVAL;
 		}
 	}
 
@@ -1844,6 +1847,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
 
 	perf_regs->xmm_regs = NULL;
+	perf_regs->ymmh_regs = NULL;
 }
 
 static inline void __x86_pmu_sample_ext_regs(u64 mask)
@@ -1869,6 +1873,8 @@ static inline void x86_pmu_update_ext_regs(struct x86_perf_regs *perf_regs,
 
 	if (mask & XFEATURE_MASK_SSE)
 		perf_regs->xmm_space = xsave->i387.xmm_space;
+	if (mask & XFEATURE_MASK_YMM)
+		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
 }
 
 /*
@@ -1928,6 +1934,8 @@ static void x86_pmu_sample_extended_regs(struct perf_event *event,
 
 	if (event_needs_xmm(event))
 		mask |= XFEATURE_MASK_SSE;
+	if (event_needs_ymm(event))
+		mask |= XFEATURE_MASK_YMM;
 
 	mask &= x86_pmu.ext_regs_mask;
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 02eea137e261..4f18ba6ef0c4 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -149,6 +149,15 @@ static inline bool event_needs_xmm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_ymm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS)
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 1f172740916c..bffe47851676 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -714,6 +714,10 @@ struct x86_perf_regs {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
 	};
+	union {
+		u64	*ymmh_regs;
+		struct ymmh_struct *ymmh;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 342b08448138..eac11a29fce6 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -57,14 +57,16 @@ enum perf_event_x86_regs {
 
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
-	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
+	PERF_X86_SIMD_YMM_REGS      = 16,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_YMM_REGS,
 };
 
 #define PERF_X86_SIMD_VEC_MASK	GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
 enum {
 	PERF_X86_XMM_QWORDS      = 2,
-	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
+	PERF_X86_YMM_QWORDS      = 4,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_YMM_QWORDS,
 };
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 9947a6b5c260..4062a679cc5b 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -77,6 +77,8 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
+#define PERF_X86_YMMH_QWORDS	(PERF_X86_YMM_QWORDS / 2)
+
 u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			u16 qwords_idx, bool pred)
 {
@@ -95,6 +97,11 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			return 0;
 		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS +
 					   qwords_idx];
+	} else if (qwords_idx < PERF_X86_YMM_QWORDS) {
+		if (!perf_regs->ymmh_regs)
+			return 0;
+		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS +
+					    qwords_idx - PERF_X86_XMM_QWORDS];
 	}
 
 	return 0;
@@ -111,7 +118,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 		if (vec_mask)
 			return -EINVAL;
 	} else {
-		if (vec_qwords != PERF_X86_XMM_QWORDS)
+		if (vec_qwords != PERF_X86_XMM_QWORDS &&
+		    vec_qwords != PERF_X86_YMM_QWORDS)
 			return -EINVAL;
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 15/22] perf/x86: Enable ZMM sampling using sample_simd_vec_reg_* fields
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (13 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 14/22] perf/x86: Enable YMM " Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 16/22] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for sampling ZMM registers via the
sample_simd_vec_reg_* fields.

Each ZMM register consists of 8 u64 words. Current x86 hardware supports
up to 32 ZMM registers. For ZMM registers from ZMM0 to ZMM15, they are
assembled from three parts: XMM (the lower 2 u64 words),
YMMH (the middle 2 u64 words), and ZMMH (the upper 4 u64 words). The
perf_simd_reg_value() function is responsible for assembling these three
parts into a complete ZMM register for output to userspace.

For ZMM registers ZMM16 to ZMM31, each register can be read as a whole
and directly outputted to userspace.

Additionally, sample_simd_vec_reg_qwords should be set to 8 to indicate
ZMM sampling.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 16 ++++++++++++++++
 arch/x86/events/perf_event.h          | 19 +++++++++++++++++++
 arch/x86/include/asm/perf_event.h     |  8 ++++++++
 arch/x86/include/uapi/asm/perf_regs.h |  8 ++++++--
 arch/x86/kernel/perf_regs.c           | 16 +++++++++++++++-
 5 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e80a392e30b0..b279dfc1c97f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -723,6 +723,12 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_ymm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
 				return -EINVAL;
+			if (event_needs_low16_zmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_ZMM_Hi256))
+				return -EINVAL;
+			if (event_needs_high16_zmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_Hi16_ZMM))
+				return -EINVAL;
 		}
 	}
 
@@ -1848,6 +1854,8 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 
 	perf_regs->xmm_regs = NULL;
 	perf_regs->ymmh_regs = NULL;
+	perf_regs->zmmh_regs = NULL;
+	perf_regs->h16zmm_regs = NULL;
 }
 
 static inline void __x86_pmu_sample_ext_regs(u64 mask)
@@ -1875,6 +1883,10 @@ static inline void x86_pmu_update_ext_regs(struct x86_perf_regs *perf_regs,
 		perf_regs->xmm_space = xsave->i387.xmm_space;
 	if (mask & XFEATURE_MASK_YMM)
 		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
+	if (mask & XFEATURE_MASK_ZMM_Hi256)
+		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
+	if (mask & XFEATURE_MASK_Hi16_ZMM)
+		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 }
 
 /*
@@ -1936,6 +1948,10 @@ static void x86_pmu_sample_extended_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_SSE;
 	if (event_needs_ymm(event))
 		mask |= XFEATURE_MASK_YMM;
+	if (event_needs_low16_zmm(event))
+		mask |= XFEATURE_MASK_ZMM_Hi256;
+	if (event_needs_high16_zmm(event))
+		mask |= XFEATURE_MASK_Hi16_ZMM;
 
 	mask &= x86_pmu.ext_regs_mask;
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 4f18ba6ef0c4..f6379adb8e83 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -158,6 +158,25 @@ static inline bool event_needs_ymm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_low16_zmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_ZMM_QWORDS)
+		return true;
+
+	return false;
+}
+
+static inline bool event_needs_high16_zmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+	     fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index bffe47851676..a57386ae70d9 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -718,6 +718,14 @@ struct x86_perf_regs {
 		u64	*ymmh_regs;
 		struct ymmh_struct *ymmh;
 	};
+	union {
+		u64	*zmmh_regs;
+		struct avx_512_zmm_uppers_state *zmmh;
+	};
+	union {
+		u64	*h16zmm_regs;
+		struct avx_512_hi16_state *h16zmm;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index eac11a29fce6..d6362bc8d125 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -58,15 +58,19 @@ enum perf_event_x86_regs {
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
 	PERF_X86_SIMD_YMM_REGS      = 16,
-	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_YMM_REGS,
+	PERF_X86_SIMD_ZMM_REGS      = 32,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
 };
 
 #define PERF_X86_SIMD_VEC_MASK	GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
+#define PERF_X86_H16ZMM_BASE		16
+
 enum {
 	PERF_X86_XMM_QWORDS      = 2,
 	PERF_X86_YMM_QWORDS      = 4,
-	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_YMM_QWORDS,
+	PERF_X86_ZMM_QWORDS      = 8,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
 };
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 4062a679cc5b..fe4ff4d2de88 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -78,6 +78,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 }
 
 #define PERF_X86_YMMH_QWORDS	(PERF_X86_YMM_QWORDS / 2)
+#define PERF_X86_ZMMH_QWORDS	(PERF_X86_ZMM_QWORDS / 2)
 
 u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			u16 qwords_idx, bool pred)
@@ -92,6 +93,13 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
 		return 0;
 
+	if (idx >= PERF_X86_H16ZMM_BASE) {
+		if (!perf_regs->h16zmm_regs)
+			return 0;
+		return perf_regs->h16zmm_regs[(idx - PERF_X86_H16ZMM_BASE) *
+					PERF_X86_ZMM_QWORDS + qwords_idx];
+	}
+
 	if (qwords_idx < PERF_X86_XMM_QWORDS) {
 		if (!perf_regs->xmm_regs)
 			return 0;
@@ -102,6 +110,11 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			return 0;
 		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS +
 					    qwords_idx - PERF_X86_XMM_QWORDS];
+	} else if (qwords_idx < PERF_X86_ZMM_QWORDS) {
+		if (!perf_regs->zmmh_regs)
+			return 0;
+		return perf_regs->zmmh_regs[idx * PERF_X86_ZMMH_QWORDS +
+					    qwords_idx - PERF_X86_YMM_QWORDS];
 	}
 
 	return 0;
@@ -119,7 +132,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 			return -EINVAL;
 	} else {
 		if (vec_qwords != PERF_X86_XMM_QWORDS &&
-		    vec_qwords != PERF_X86_YMM_QWORDS)
+		    vec_qwords != PERF_X86_YMM_QWORDS &&
+		    vec_qwords != PERF_X86_ZMM_QWORDS)
 			return -EINVAL;
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 16/22] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (14 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 15/22] perf/x86: Enable ZMM " Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 17/22] perf: Enhance perf_reg_validate() with simd_enabled argument Dapeng Mi
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for sampling OPAMSK registers via the
sample_simd_pred_reg_* fields.

Each OPMASK register consists of 1 u64 word. Current x86 hardware
supports 8 OPMASK registers. The perf_simd_reg_value() function is
responsible for outputting OPMASK value to userspace.

Additionally, sample_simd_pred_reg_qwords should be set to 1 to indicate
OPMASK sampling.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                |  8 ++++++++
 arch/x86/events/perf_event.h          | 10 ++++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  5 +++++
 arch/x86/kernel/perf_regs.c           | 15 ++++++++++++---
 5 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index b279dfc1c97f..2a674436f07e 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -729,6 +729,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_high16_zmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_Hi16_ZMM))
 				return -EINVAL;
+			if (event_needs_opmask(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_OPMASK))
+				return -EINVAL;
 		}
 	}
 
@@ -1856,6 +1859,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->ymmh_regs = NULL;
 	perf_regs->zmmh_regs = NULL;
 	perf_regs->h16zmm_regs = NULL;
+	perf_regs->opmask_regs = NULL;
 }
 
 static inline void __x86_pmu_sample_ext_regs(u64 mask)
@@ -1887,6 +1891,8 @@ static inline void x86_pmu_update_ext_regs(struct x86_perf_regs *perf_regs,
 		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
 	if (mask & XFEATURE_MASK_Hi16_ZMM)
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
+	if (mask & XFEATURE_MASK_OPMASK)
+		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 }
 
 /*
@@ -1952,6 +1958,8 @@ static void x86_pmu_sample_extended_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_ZMM_Hi256;
 	if (event_needs_high16_zmm(event))
 		mask |= XFEATURE_MASK_Hi16_ZMM;
+	if (event_needs_opmask(event))
+		mask |= XFEATURE_MASK_OPMASK;
 
 	mask &= x86_pmu.ext_regs_mask;
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index f6379adb8e83..c9d6379c4ddb 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -177,6 +177,16 @@ static inline bool event_needs_high16_zmm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_opmask(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_simd_pred_reg_intr ||
+	     event->attr.sample_simd_pred_reg_user))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index a57386ae70d9..6c5a34e0dfc8 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -726,6 +726,10 @@ struct x86_perf_regs {
 		u64	*h16zmm_regs;
 		struct avx_512_hi16_state *h16zmm;
 	};
+	union {
+		u64	*opmask_regs;
+		struct avx_512_opmask_state *opmask;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index d6362bc8d125..dae39df134ec 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -60,13 +60,18 @@ enum {
 	PERF_X86_SIMD_YMM_REGS      = 16,
 	PERF_X86_SIMD_ZMM_REGS      = 32,
 	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
+
+	PERF_X86_SIMD_OPMASK_REGS   = 8,
+	PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
 };
 
+#define PERF_X86_SIMD_PRED_MASK	GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
 #define PERF_X86_SIMD_VEC_MASK	GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
 #define PERF_X86_H16ZMM_BASE		16
 
 enum {
+	PERF_X86_OPMASK_QWORDS   = 1,
 	PERF_X86_XMM_QWORDS      = 2,
 	PERF_X86_YMM_QWORDS      = 4,
 	PERF_X86_ZMM_QWORDS      = 8,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index fe4ff4d2de88..2e3c10dffb35 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -86,8 +86,14 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 	struct x86_perf_regs *perf_regs =
 			container_of(regs, struct x86_perf_regs, regs);
 
-	if (pred)
-		return 0;
+	if (pred) {
+		if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_PRED_REGS_MAX ||
+				 qwords_idx >= PERF_X86_OPMASK_QWORDS))
+			return 0;
+		if (!perf_regs->opmask_regs)
+			return 0;
+		return perf_regs->opmask_regs[idx];
+	}
 
 	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
@@ -138,7 +144,10 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
 	}
-	if (pred_mask)
+
+	if (pred_qwords != PERF_X86_OPMASK_QWORDS)
+		return -EINVAL;
+	if (pred_mask & ~PERF_X86_SIMD_PRED_MASK)
 		return -EINVAL;
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 17/22] perf: Enhance perf_reg_validate() with simd_enabled argument
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (15 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 16/22] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 18/22] perf/x86: Enable eGPRs sampling using sample_regs_* fields Dapeng Mi
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

The upcoming patch will support x86 APX eGPRs sampling by using the
reclaimed XMM register space to represent eGPRs in sample_regs_* fields.

To differentiate between XMM and eGPRs in sample_regs_* fields, an
additional argument, simd_enabled, is introduced to the
perf_reg_validate() helper. If simd_enabled is set to 1, it indicates
that eGPRs are represented in sample_regs_* fields for the x86 platform;
otherwise, XMM registers are represented.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V6: new patch, splited from the next patch.

 arch/arm/kernel/perf_regs.c       | 2 +-
 arch/arm64/kernel/perf_regs.c     | 2 +-
 arch/csky/kernel/perf_regs.c      | 2 +-
 arch/loongarch/kernel/perf_regs.c | 2 +-
 arch/mips/kernel/perf_regs.c      | 2 +-
 arch/parisc/kernel/perf_regs.c    | 2 +-
 arch/powerpc/perf/perf_regs.c     | 2 +-
 arch/riscv/kernel/perf_regs.c     | 2 +-
 arch/s390/kernel/perf_regs.c      | 2 +-
 arch/x86/kernel/perf_regs.c       | 4 ++--
 include/linux/perf_regs.h         | 2 +-
 kernel/events/core.c              | 8 +++++---
 12 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index d575a4c3ca56..838d701adf4d 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_ARM_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index 70e2f13f587f..71a3e0238de4 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -77,7 +77,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_ARM64_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	u64 reserved_mask = REG_RESERVED;
 
diff --git a/arch/csky/kernel/perf_regs.c b/arch/csky/kernel/perf_regs.c
index 94601f37b596..c932a96afc56 100644
--- a/arch/csky/kernel/perf_regs.c
+++ b/arch/csky/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_CSKY_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/loongarch/kernel/perf_regs.c b/arch/loongarch/kernel/perf_regs.c
index 8dd604f01745..164514f40ae0 100644
--- a/arch/loongarch/kernel/perf_regs.c
+++ b/arch/loongarch/kernel/perf_regs.c
@@ -25,7 +25,7 @@ u64 perf_reg_abi(struct task_struct *tsk)
 }
 #endif /* CONFIG_32BIT */
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask)
 		return -EINVAL;
diff --git a/arch/mips/kernel/perf_regs.c b/arch/mips/kernel/perf_regs.c
index 7736d3c5ebd2..00a5201dbd5d 100644
--- a/arch/mips/kernel/perf_regs.c
+++ b/arch/mips/kernel/perf_regs.c
@@ -28,7 +28,7 @@ u64 perf_reg_abi(struct task_struct *tsk)
 }
 #endif /* CONFIG_32BIT */
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask)
 		return -EINVAL;
diff --git a/arch/parisc/kernel/perf_regs.c b/arch/parisc/kernel/perf_regs.c
index b9fe1f2fcb9b..4f21aab5405c 100644
--- a/arch/parisc/kernel/perf_regs.c
+++ b/arch/parisc/kernel/perf_regs.c
@@ -34,7 +34,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_PARISC_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index 350dccb0143c..a01d8a903640 100644
--- a/arch/powerpc/perf/perf_regs.c
+++ b/arch/powerpc/perf/perf_regs.c
@@ -125,7 +125,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/riscv/kernel/perf_regs.c b/arch/riscv/kernel/perf_regs.c
index 3bba8deababb..1ecc8760b88b 100644
--- a/arch/riscv/kernel/perf_regs.c
+++ b/arch/riscv/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_RISCV_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/s390/kernel/perf_regs.c b/arch/s390/kernel/perf_regs.c
index 7b305f1456f8..6496fd23c540 100644
--- a/arch/s390/kernel/perf_regs.c
+++ b/arch/s390/kernel/perf_regs.c
@@ -34,7 +34,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1UL << PERF_REG_S390_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 2e3c10dffb35..9b3134220b3e 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -166,7 +166,7 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 		       (1ULL << PERF_REG_X86_R14) | \
 		       (1ULL << PERF_REG_X86_R15))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
 		return -EINVAL;
@@ -185,7 +185,7 @@ u64 perf_reg_abi(struct task_struct *task)
 		       (1ULL << PERF_REG_X86_FS) | \
 		       (1ULL << PERF_REG_X86_GS))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	/* The mask could be 0 if only the SIMD registers are interested */
 	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 518f28c6a7d4..09dbc2fc3859 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -10,7 +10,7 @@ struct perf_regs {
 };
 
 u64 perf_reg_value(struct pt_regs *regs, int idx);
-int perf_reg_validate(u64 mask);
+int perf_reg_validate(u64 mask, bool simd_enabled);
 u64 perf_reg_abi(struct task_struct *task);
 void perf_get_regs_user(struct perf_regs *regs_user,
 			struct pt_regs *regs);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5742126f50cc..8b27b4873dd0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7728,7 +7728,7 @@ u64 __weak perf_reg_value(struct pt_regs *regs, int idx)
 	return 0;
 }
 
-int __weak perf_reg_validate(u64 mask)
+int __weak perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	return mask ? -ENOSYS : 0;
 }
@@ -13614,7 +13614,8 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_REGS_USER) {
-		ret = perf_reg_validate(attr->sample_regs_user);
+		ret = perf_reg_validate(attr->sample_regs_user,
+					attr->sample_simd_regs_enabled);
 		if (ret)
 			return ret;
 		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
@@ -13644,7 +13645,8 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		attr->sample_max_stack = sysctl_perf_event_max_stack;
 
 	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
-		ret = perf_reg_validate(attr->sample_regs_intr);
+		ret = perf_reg_validate(attr->sample_regs_intr,
+					attr->sample_simd_regs_enabled);
 		if (ret)
 			return ret;
 		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 18/22] perf/x86: Enable eGPRs sampling using sample_regs_* fields
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (16 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 17/22] perf: Enhance perf_reg_validate() with simd_enabled argument Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 19/22] perf/x86: Enable SSP " Dapeng Mi
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch enables sampling of APX eGPRs (R16 ~ R31) via the
sample_regs_* fields.

To sample eGPRs, the sample_simd_regs_enabled field must be set. This
allows the spare space (reclaimed from the original XMM space) in the
sample_regs_* fields to be used for representing eGPRs.

The perf_reg_value() function needs to check if the
PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then determine whether
to output eGPRs or legacy XMM registers to userspace.

The perf_reg_validate() function first checks the simd_enabled argument
to determine if the eGPRs bitmap is represented in sample_regs_* fields.
It then validates the eGPRs bitmap accordingly.

Currently, eGPRs sampling is only supported on the x86_64 architecture, as
APX is only available on x86_64 platforms.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 37 ++++++++++++++++-------
 arch/x86/events/perf_event.h          | 10 +++++++
 arch/x86/include/asm/perf_event.h     |  4 +++
 arch/x86/include/uapi/asm/perf_regs.h | 25 ++++++++++++++++
 arch/x86/kernel/perf_regs.c           | 43 ++++++++++++++++-----------
 5 files changed, 90 insertions(+), 29 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 2a674436f07e..b320a58ede3f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -697,20 +697,21 @@ int x86_pmu_hw_config(struct perf_event *event)
 	}
 
 	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
-		/*
-		 * Besides the general purpose registers, XMM registers may
-		 * be collected as well.
-		 */
-		if (event_has_extended_regs(event)) {
-			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
-				return -EINVAL;
-			if (event->attr.sample_simd_regs_enabled)
-				return -EINVAL;
-		}
-
 		if (event_has_simd_regs(event)) {
+			u64 reserved = ~GENMASK_ULL(PERF_REG_MISC_MAX - 1, 0);
+
 			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
 				return -EINVAL;
+			/*
+			 * The XMM space in the perf_event_x86_regs is reclaimed
+			 * for eGPRs and other general registers.
+			 */
+			if (event->attr.sample_regs_user & reserved ||
+			    event->attr.sample_regs_intr & reserved)
+				return -EINVAL;
+			if (event_needs_egprs(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
+				return -EINVAL;
 			/* Not require any vector registers but set width */
 			if (event->attr.sample_simd_vec_reg_qwords &&
 			    !event->attr.sample_simd_vec_reg_intr &&
@@ -732,6 +733,15 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_opmask(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_OPMASK))
 				return -EINVAL;
+		} else {
+			/*
+			 * Besides the general purpose registers, XMM registers may
+			 * be collected as well.
+			 */
+			if (event_has_extended_regs(event)) {
+				if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+					return -EINVAL;
+			}
 		}
 	}
 
@@ -1860,6 +1870,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->zmmh_regs = NULL;
 	perf_regs->h16zmm_regs = NULL;
 	perf_regs->opmask_regs = NULL;
+	perf_regs->egpr_regs = NULL;
 }
 
 static inline void __x86_pmu_sample_ext_regs(u64 mask)
@@ -1893,6 +1904,8 @@ static inline void x86_pmu_update_ext_regs(struct x86_perf_regs *perf_regs,
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 	if (mask & XFEATURE_MASK_OPMASK)
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
+	if (mask & XFEATURE_MASK_APX)
+		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
 }
 
 /*
@@ -1960,6 +1973,8 @@ static void x86_pmu_sample_extended_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_Hi16_ZMM;
 	if (event_needs_opmask(event))
 		mask |= XFEATURE_MASK_OPMASK;
+	if (event_needs_egprs(event))
+		mask |= XFEATURE_MASK_APX;
 
 	mask &= x86_pmu.ext_regs_mask;
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index c9d6379c4ddb..33c187f9b7ab 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -187,6 +187,16 @@ static inline bool event_needs_opmask(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_egprs(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_regs_user & PERF_X86_EGPRS_MASK ||
+	     event->attr.sample_regs_intr & PERF_X86_EGPRS_MASK))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 6c5a34e0dfc8..cecf1e8d002f 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -730,6 +730,10 @@ struct x86_perf_regs {
 		u64	*opmask_regs;
 		struct avx_512_opmask_state *opmask;
 	};
+	union {
+		u64	*egpr_regs;
+		struct apx_state *egpr;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index dae39df134ec..f9b4086085bc 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,33 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/*
+	 * The EGPRs and XMM have overlaps. Only one can be used
+	 * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
+	 * utilize EGPRs. For the other ABI type, XMM is used.
+	 *
+	 * Extended GPRs (EGPRs)
+	 */
+	PERF_REG_X86_R16,
+	PERF_REG_X86_R17,
+	PERF_REG_X86_R18,
+	PERF_REG_X86_R19,
+	PERF_REG_X86_R20,
+	PERF_REG_X86_R21,
+	PERF_REG_X86_R22,
+	PERF_REG_X86_R23,
+	PERF_REG_X86_R24,
+	PERF_REG_X86_R25,
+	PERF_REG_X86_R26,
+	PERF_REG_X86_R27,
+	PERF_REG_X86_R28,
+	PERF_REG_X86_R29,
+	PERF_REG_X86_R30,
+	PERF_REG_X86_R31,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	PERF_REG_MISC_MAX = PERF_REG_X86_R31 + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
@@ -54,6 +78,7 @@ enum perf_event_x86_regs {
 };
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
+#define PERF_X86_EGPRS_MASK	GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
 
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 9b3134220b3e..1c2a8c2c7bf1 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -61,14 +61,22 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
 	struct x86_perf_regs *perf_regs;
 
-	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+	if (idx > PERF_REG_X86_R15) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
-		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
-		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
-			return 0;
-		if (!perf_regs->xmm_regs)
-			return 0;
-		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+			if (idx <= PERF_REG_X86_R31) {
+				if (!perf_regs->egpr_regs)
+					return 0;
+				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
+			}
+		} else {
+			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+				if (!perf_regs->xmm_regs)
+					return 0;
+				return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+			}
+		}
 	}
 
 	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
@@ -153,18 +161,12 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 	return 0;
 }
 
-#define PERF_REG_X86_RESERVED	(((1ULL << PERF_REG_X86_XMM0) - 1) & \
-				 ~((1ULL << PERF_REG_X86_MAX) - 1))
+#define PERF_REG_X86_RESERVED	(GENMASK_ULL(PERF_REG_X86_XMM0 - 1, PERF_REG_X86_AX) & \
+				 ~GENMASK_ULL(PERF_REG_X86_R15, PERF_REG_X86_AX))
+#define PERF_REG_X86_EXT_RESERVED	(~GENMASK_ULL(PERF_REG_MISC_MAX - 1, PERF_REG_X86_AX))
 
 #ifdef CONFIG_X86_32
-#define REG_NOSUPPORT ((1ULL << PERF_REG_X86_R8) | \
-		       (1ULL << PERF_REG_X86_R9) | \
-		       (1ULL << PERF_REG_X86_R10) | \
-		       (1ULL << PERF_REG_X86_R11) | \
-		       (1ULL << PERF_REG_X86_R12) | \
-		       (1ULL << PERF_REG_X86_R13) | \
-		       (1ULL << PERF_REG_X86_R14) | \
-		       (1ULL << PERF_REG_X86_R15))
+#define REG_NOSUPPORT GENMASK_ULL(PERF_REG_X86_R15, PERF_REG_X86_R8)
 
 int perf_reg_validate(u64 mask, bool simd_enabled)
 {
@@ -188,7 +190,12 @@ u64 perf_reg_abi(struct task_struct *task)
 int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	/* The mask could be 0 if only the SIMD registers are interested */
-	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
+	if (!simd_enabled &&
+	    (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
+		return -EINVAL;
+
+	if (simd_enabled &&
+	    (mask & (REG_NOSUPPORT | PERF_REG_X86_EXT_RESERVED)))
 		return -EINVAL;
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 19/22] perf/x86: Enable SSP sampling using sample_regs_* fields
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (17 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 18/22] perf/x86: Enable eGPRs sampling using sample_regs_* fields Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 20/22] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch enables sampling of CET SSP register via the sample_regs_*
fields.

To sample SSP, the sample_simd_regs_enabled field must be set. This
allows the spare space (reclaimed from the original XMM space) in the
sample_regs_* fields to be used for representing SSP.

Similar with eGPRs sampling, the perf_reg_value() function needs to
check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then
determine whether to output SSP or legacy XMM registers to userspace.

Additionally, arch-PEBS supports sampling SSP, which is placed into the
GPRs group. This patch also enables arch-PEBS-based SSP sampling.

Currently, SSP sampling is only supported on the x86_64 architecture, as
CET is only available on x86_64 platforms.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V6: Ensure SSP value is 0 for non-user-space sampling since currently
SSP is only enabled for user space.

 arch/x86/events/core.c                |  9 +++++++++
 arch/x86/events/intel/ds.c            |  7 +++++++
 arch/x86/events/perf_event.h          | 10 ++++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  7 ++++---
 arch/x86/kernel/perf_regs.c           |  5 +++++
 6 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index b320a58ede3f..81dc23e658f2 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -712,6 +712,10 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_egprs(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
 				return -EINVAL;
+			if (event_needs_ssp(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_CET_USER))
+				return -EINVAL;
+
 			/* Not require any vector registers but set width */
 			if (event->attr.sample_simd_vec_reg_qwords &&
 			    !event->attr.sample_simd_vec_reg_intr &&
@@ -1871,6 +1875,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->h16zmm_regs = NULL;
 	perf_regs->opmask_regs = NULL;
 	perf_regs->egpr_regs = NULL;
+	perf_regs->cet_regs = NULL;
 }
 
 static inline void __x86_pmu_sample_ext_regs(u64 mask)
@@ -1906,6 +1911,8 @@ static inline void x86_pmu_update_ext_regs(struct x86_perf_regs *perf_regs,
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 	if (mask & XFEATURE_MASK_APX)
 		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
+	if (mask & XFEATURE_MASK_CET_USER)
+		perf_regs->cet = get_xsave_addr(xsave, XFEATURE_CET_USER);
 }
 
 /*
@@ -1975,6 +1982,8 @@ static void x86_pmu_sample_extended_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_OPMASK;
 	if (event_needs_egprs(event))
 		mask |= XFEATURE_MASK_APX;
+	if (event_needs_ssp(event))
+		mask |= XFEATURE_MASK_CET_USER;
 
 	mask &= x86_pmu.ext_regs_mask;
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 272725d749df..ff8707885f74 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2680,6 +2680,13 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		__setup_pebs_gpr_group(event, data, regs,
 				       (struct pebs_gprs *)gprs,
 				       sample_type);
+
+		/* Currently only user space mode enables SSP. */
+		if (user_mode(regs) && (sample_type &
+		    (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))) {
+			perf_regs->cet_regs = &gprs->r15;
+			ignore_mask = XFEATURE_MASK_CET_USER;
+		}
 	}
 
 	if (header->aux) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 33c187f9b7ab..fdfb34d7b1d2 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -197,6 +197,16 @@ static inline bool event_needs_egprs(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_ssp(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP) ||
+	     event->attr.sample_regs_intr & BIT_ULL(PERF_REG_X86_SSP)))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index cecf1e8d002f..98fef9db0aa3 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -734,6 +734,10 @@ struct x86_perf_regs {
 		u64	*egpr_regs;
 		struct apx_state *egpr;
 	};
+	union {
+		u64	*cet_regs;
+		struct cet_user_state *cet;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index f9b4086085bc..6da63e1dbb40 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -28,9 +28,9 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
 	/*
-	 * The EGPRs and XMM have overlaps. Only one can be used
+	 * The EGPRs/SSP and XMM have overlaps. Only one can be used
 	 * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
-	 * utilize EGPRs. For the other ABI type, XMM is used.
+	 * utilize EGPRs/SSP. For the other ABI type, XMM is used.
 	 *
 	 * Extended GPRs (EGPRs)
 	 */
@@ -50,10 +50,11 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R29,
 	PERF_REG_X86_R30,
 	PERF_REG_X86_R31,
+	PERF_REG_X86_SSP,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
-	PERF_REG_MISC_MAX = PERF_REG_X86_R31 + 1,
+	PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 1c2a8c2c7bf1..2e7d83f26cc0 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -70,6 +70,11 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 					return 0;
 				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
 			}
+			if (idx == PERF_REG_X86_SSP) {
+				if (!perf_regs->cet_regs)
+					return 0;
+				return perf_regs->cet_regs[1];
+			}
 		} else {
 			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 				if (!perf_regs->xmm_regs)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 20/22] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (18 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 19/22] perf/x86: Enable SSP " Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 21/22] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Enable the PERF_PMU_CAP_SIMD_REGS capability if XSAVES support is
available for YMM, ZMM, OPMASK, eGPRs, or SSP.

Temporarily disable large PEBS sampling for these registers, as the
current arch-PEBS sampling code does not support them yet. Large PEBS
sampling for these registers will be enabled in subsequent patches.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c | 52 ++++++++++++++++++++++++++++++++----
 1 file changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ae7693e586d3..1f063a1418fb 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4426,11 +4426,33 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 		flags &= ~PERF_SAMPLE_TIME;
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
-		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_intr &
-	    ~(PEBS_GP_REGS | PERF_REG_EXTENDED_MASK))
-		flags &= ~PERF_SAMPLE_REGS_INTR;
+	if (event->attr.sample_simd_regs_enabled) {
+		u64 nolarge = PERF_X86_EGPRS_MASK | BIT_ULL(PERF_REG_X86_SSP);
+
+		/*
+		 * PEBS HW can only collect the XMM0-XMM15 for now.
+		 * Disable large PEBS for other vector registers, predicate
+		 * registers, eGPRs, and SSP.
+		 */
+		if (event->attr.sample_regs_user & nolarge ||
+		    fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE ||
+		    event->attr.sample_simd_pred_reg_user)
+			flags &= ~PERF_SAMPLE_REGS_USER;
+
+		if (event->attr.sample_regs_intr & nolarge ||
+		    fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+		    event->attr.sample_simd_pred_reg_intr)
+			flags &= ~PERF_SAMPLE_REGS_INTR;
+
+		if (event->attr.sample_simd_vec_reg_qwords > PERF_X86_XMM_QWORDS)
+			flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
+	} else {
+		if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
+			flags &= ~PERF_SAMPLE_REGS_USER;
+		if (event->attr.sample_regs_intr &
+		    ~(PEBS_GP_REGS | PERF_REG_EXTENDED_MASK))
+			flags &= ~PERF_SAMPLE_REGS_INTR;
+	}
 	return flags;
 }
 
@@ -5904,6 +5926,26 @@ static void intel_extended_regs_init(struct pmu *pmu)
 
 	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
 	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+
+	if (boot_cpu_has(X86_FEATURE_AVX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_YMM, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_YMM;
+	if (boot_cpu_has(X86_FEATURE_APX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_APX, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_APX;
+	if (boot_cpu_has(X86_FEATURE_AVX512F)) {
+		if (cpu_has_xfeatures(XFEATURE_MASK_OPMASK, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_OPMASK;
+		if (cpu_has_xfeatures(XFEATURE_MASK_ZMM_Hi256, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_ZMM_Hi256;
+		if (cpu_has_xfeatures(XFEATURE_MASK_Hi16_ZMM, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_Hi16_ZMM;
+	}
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_CET_USER;
+
+	if (x86_pmu.ext_regs_mask != XFEATURE_MASK_SSE)
+		x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_SIMD_REGS;
 }
 
 #define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 21/22] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (19 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 20/22] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  7:20 ` [Patch v6 22/22] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
  2026-02-09  8:48 ` [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Mi, Dapeng
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

This patch enables arch-PEBS based SIMD/eGPRs/SSP registers sampling.

Arch-PEBS supports sampling of these registers, with all except SSP
placed into the XSAVE-Enabled Registers (XER) group with the layout
described below.

Field Name 	Registers Used 			Size
----------------------------------------------------------------------
XSTATE_BV	XINUSE for groups		8 B
----------------------------------------------------------------------
Reserved 	Reserved 			8 B
----------------------------------------------------------------------
SSER 		XMM0-XMM15 			16 regs * 16 B = 256 B
----------------------------------------------------------------------
YMMHIR 		Upper 128 bits of YMM0-YMM15 	16 regs * 16 B = 256 B
----------------------------------------------------------------------
EGPR 		R16-R31 			16 regs * 8 B = 128 B
----------------------------------------------------------------------
OPMASKR 	K0-K7 				8 regs * 8 B = 64 B
----------------------------------------------------------------------
ZMMHIR 		Upper 256 bits of ZMM0-ZMM15 	16 regs * 32 B = 512 B
----------------------------------------------------------------------
Hi16ZMMR 	ZMM16-ZMM31 			16 regs * 64 B = 1024 B
----------------------------------------------------------------------

Memory space in the output buffer is allocated for these sub-groups as
long as the corresponding Format.XER[55:49] bits in the PEBS record
header are set. However, the arch-PEBS hardware engine does not write
the sub-group if it is not used (in INIT state). In such cases, the
corresponding bit in the XSTATE_BV bitmap is set to 0. Therefore, the
XSTATE_BV field is checked to determine if the register data is actually
written for each PEBS record. If not, the register data is not outputted
to userspace.

The SSP register is sampled and placed into the GPRs group by arch-PEBS.

Additionally, the MSRs IA32_PMC_{GPn|FXm}_CFG_C.[55:49] bits are used to
manage which types of these registers need to be sampled.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c      | 75 ++++++++++++++++++++++--------
 arch/x86/events/intel/ds.c        | 77 ++++++++++++++++++++++++++++---
 arch/x86/include/asm/msr-index.h  |  7 +++
 arch/x86/include/asm/perf_event.h |  8 +++-
 4 files changed, 142 insertions(+), 25 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 1f063a1418fb..c57a70798364 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3221,6 +3221,21 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
 			if (pebs_data_cfg & PEBS_DATACFG_XMMS)
 				ext |= ARCH_PEBS_VECR_XMM & cap.caps;
 
+			if (pebs_data_cfg & PEBS_DATACFG_YMMHS)
+				ext |= ARCH_PEBS_VECR_YMMH & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_EGPRS)
+				ext |= ARCH_PEBS_VECR_EGPRS & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
+				ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
+				ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
+				ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
+
 			if (pebs_data_cfg & PEBS_DATACFG_LBRS)
 				ext |= ARCH_PEBS_LBR & cap.caps;
 
@@ -4418,6 +4433,34 @@ static void intel_pebs_aliases_skl(struct perf_event *event)
 	return intel_pebs_aliases_precdist(event);
 }
 
+static inline bool intel_pebs_support_regs(struct perf_event *event, u64 regs)
+{
+	struct arch_pebs_cap cap = hybrid(event->pmu, arch_pebs_cap);
+	int pebs_format = x86_pmu.intel_cap.pebs_format;
+	bool supported = true;
+
+	/* SSP */
+	if (regs & PEBS_DATACFG_GP)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_GPR & cap.caps);
+	if (regs & PEBS_DATACFG_XMMS) {
+		supported &= x86_pmu.arch_pebs ?
+			     ARCH_PEBS_VECR_XMM & cap.caps :
+			     pebs_format > 3 && x86_pmu.intel_cap.pebs_baseline;
+	}
+	if (regs & PEBS_DATACFG_YMMHS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_YMMH & cap.caps);
+	if (regs & PEBS_DATACFG_EGPRS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_EGPRS & cap.caps);
+	if (regs & PEBS_DATACFG_OPMASKS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_OPMASK & cap.caps);
+	if (regs & PEBS_DATACFG_ZMMHS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_ZMMH & cap.caps);
+	if (regs & PEBS_DATACFG_H16ZMMS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_H16ZMM & cap.caps);
+
+	return supported;
+}
+
 static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.large_pebs_flags;
@@ -4427,24 +4470,20 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
 	if (event->attr.sample_simd_regs_enabled) {
-		u64 nolarge = PERF_X86_EGPRS_MASK | BIT_ULL(PERF_REG_X86_SSP);
-
-		/*
-		 * PEBS HW can only collect the XMM0-XMM15 for now.
-		 * Disable large PEBS for other vector registers, predicate
-		 * registers, eGPRs, and SSP.
-		 */
-		if (event->attr.sample_regs_user & nolarge ||
-		    fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE ||
-		    event->attr.sample_simd_pred_reg_user)
-			flags &= ~PERF_SAMPLE_REGS_USER;
-
-		if (event->attr.sample_regs_intr & nolarge ||
-		    fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
-		    event->attr.sample_simd_pred_reg_intr)
-			flags &= ~PERF_SAMPLE_REGS_INTR;
-
-		if (event->attr.sample_simd_vec_reg_qwords > PERF_X86_XMM_QWORDS)
+		if ((event_needs_ssp(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_GP)) ||
+		    (event_needs_xmm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_XMMS)) ||
+		    (event_needs_ymm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_YMMHS)) ||
+		    (event_needs_egprs(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_EGPRS)) ||
+		    (event_needs_opmask(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_OPMASKS)) ||
+		    (event_needs_low16_zmm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_ZMMHS)) ||
+		    (event_needs_high16_zmm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_H16ZMMS)))
 			flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
 	} else {
 		if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index ff8707885f74..2851622fbf0f 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1732,11 +1732,22 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 		     ((attr->config & INTEL_ARCH_EVENT_MASK) ==
 		      x86_pmu.rtm_abort_event);
 
-	if (gprs || (attr->precise_ip < 2) || tsx_weight)
+	if (gprs || (attr->precise_ip < 2) ||
+	    tsx_weight || event_needs_ssp(event))
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
 	if (event_needs_xmm(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
+	if (event_needs_ymm(event))
+		pebs_data_cfg |= PEBS_DATACFG_YMMHS;
+	if (event_needs_low16_zmm(event))
+		pebs_data_cfg |= PEBS_DATACFG_ZMMHS;
+	if (event_needs_high16_zmm(event))
+		pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
+	if (event_needs_opmask(event))
+		pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
+	if (event_needs_egprs(event))
+		pebs_data_cfg |= PEBS_DATACFG_EGPRS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
 		/*
@@ -2699,15 +2710,69 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 					   meminfo->tsx_tuning, ax);
 	}
 
-	if (header->xmm) {
+	if (header->xmm || header->ymmh || header->egpr ||
+	    header->opmask || header->zmmh || header->h16zmm) {
+		struct arch_pebs_xer_header *xer_header = next_record;
 		struct pebs_xmm *xmm;
+		struct ymmh_struct *ymmh;
+		struct avx_512_zmm_uppers_state *zmmh;
+		struct avx_512_hi16_state *h16zmm;
+		struct avx_512_opmask_state *opmask;
+		struct apx_state *egpr;
 
 		next_record += sizeof(struct arch_pebs_xer_header);
 
-		ignore_mask |= XFEATURE_MASK_SSE;
-		xmm = next_record;
-		perf_regs->xmm_regs = xmm->xmm;
-		next_record = xmm + 1;
+		if (header->xmm) {
+			ignore_mask |= XFEATURE_MASK_SSE;
+			xmm = next_record;
+			/*
+			 * Only output XMM regs to user space when arch-PEBS
+			 * really writes data into xstate area.
+			 */
+			if (xer_header->xstate & XFEATURE_MASK_SSE)
+				perf_regs->xmm_regs = xmm->xmm;
+			next_record = xmm + 1;
+		}
+
+		if (header->ymmh) {
+			ignore_mask |= XFEATURE_MASK_YMM;
+			ymmh = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_YMM)
+				perf_regs->ymmh = ymmh;
+			next_record = ymmh + 1;
+		}
+
+		if (header->egpr) {
+			ignore_mask |= XFEATURE_MASK_APX;
+			egpr = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_APX)
+				perf_regs->egpr = egpr;
+			next_record = egpr + 1;
+		}
+
+		if (header->opmask) {
+			ignore_mask |= XFEATURE_MASK_OPMASK;
+			opmask = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_OPMASK)
+				perf_regs->opmask = opmask;
+			next_record = opmask + 1;
+		}
+
+		if (header->zmmh) {
+			ignore_mask |= XFEATURE_MASK_ZMM_Hi256;
+			zmmh = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_ZMM_Hi256)
+				perf_regs->zmmh = zmmh;
+			next_record = zmmh + 1;
+		}
+
+		if (header->h16zmm) {
+			ignore_mask |= XFEATURE_MASK_Hi16_ZMM;
+			h16zmm = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_Hi16_ZMM)
+				perf_regs->h16zmm = h16zmm;
+			next_record = h16zmm + 1;
+		}
 	}
 
 	if (header->lbr) {
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 6d1b69ea01c2..6c915781fdd3 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -350,6 +350,13 @@
 #define ARCH_PEBS_LBR_SHIFT		40
 #define ARCH_PEBS_LBR			(0x3ull << ARCH_PEBS_LBR_SHIFT)
 #define ARCH_PEBS_VECR_XMM		BIT_ULL(49)
+#define ARCH_PEBS_VECR_YMMH		BIT_ULL(50)
+#define ARCH_PEBS_VECR_EGPRS		BIT_ULL(51)
+#define ARCH_PEBS_VECR_OPMASK		BIT_ULL(53)
+#define ARCH_PEBS_VECR_ZMMH		BIT_ULL(54)
+#define ARCH_PEBS_VECR_H16ZMM		BIT_ULL(55)
+#define ARCH_PEBS_VECR_EXT_SHIFT	50
+#define ARCH_PEBS_VECR_EXT		(0x3full << ARCH_PEBS_VECR_EXT_SHIFT)
 #define ARCH_PEBS_GPR			BIT_ULL(61)
 #define ARCH_PEBS_AUX			BIT_ULL(62)
 #define ARCH_PEBS_EN			BIT_ULL(63)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 98fef9db0aa3..3665a0a2148e 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -148,6 +148,11 @@
 #define PEBS_DATACFG_LBRS	BIT_ULL(3)
 #define PEBS_DATACFG_CNTR	BIT_ULL(4)
 #define PEBS_DATACFG_METRICS	BIT_ULL(5)
+#define PEBS_DATACFG_YMMHS	BIT_ULL(6)
+#define PEBS_DATACFG_OPMASKS	BIT_ULL(7)
+#define PEBS_DATACFG_ZMMHS	BIT_ULL(8)
+#define PEBS_DATACFG_H16ZMMS	BIT_ULL(9)
+#define PEBS_DATACFG_EGPRS	BIT_ULL(10)
 #define PEBS_DATACFG_LBR_SHIFT	24
 #define PEBS_DATACFG_CNTR_SHIFT	32
 #define PEBS_DATACFG_CNTR_MASK	GENMASK_ULL(15, 0)
@@ -545,7 +550,8 @@ struct arch_pebs_header {
 			    rsvd3:7,
 			    xmm:1,
 			    ymmh:1,
-			    rsvd4:2,
+			    egpr:1,
+			    rsvd4:1,
 			    opmask:1,
 			    zmmh:1,
 			    h16zmm:1,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Patch v6 22/22] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (20 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 21/22] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
@ 2026-02-09  7:20 ` Dapeng Mi
  2026-02-09  8:48 ` [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Mi, Dapeng
  22 siblings, 0 replies; 45+ messages in thread
From: Dapeng Mi @ 2026-02-09  7:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

When two or more identical PEBS events with the same sampling period are
programmed on a mix of PDIST and non-PDIST counters, multiple
back-to-back NMIs can be triggered.

The Linux PMI handler processes the first NMI and clears the
GLOBAL_STATUS MSR. If a second NMI is triggered immediately after
the first, it is recognized as a "suspicious NMI" because no bits are set
in the GLOBAL_STATUS MSR (cleared by the first NMI).

This issue does not lead to PEBS data corruption or data loss, but it
does result in an annoying warning message.

The current NMI handler supports back-to-back NMI detection, but it
requires the PMI handler to return the count of actually processed events,
which the PEBS handler does not currently do.

This patch modifies the PEBS handlers to return the count of actually
processed events, thereby activating back-to-back NMI detection and
avoiding the "suspicious NMI" warning.

Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---

V6: Enhance b2b NMI detection for all PEBS handlers to ensure identical
    behaviors of all PEBS handlers

 arch/x86/events/intel/core.c |  6 ++----
 arch/x86/events/intel/ds.c   | 40 ++++++++++++++++++++++++------------
 arch/x86/events/perf_event.h |  2 +-
 3 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c57a70798364..387205c5d5b5 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3558,9 +3558,8 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 	if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long *)&status)) {
 		u64 pebs_enabled = cpuc->pebs_enabled;
 
-		handled++;
 		x86_pmu_handle_guest_pebs(regs, &data);
-		static_call(x86_pmu_drain_pebs)(regs, &data);
+		handled += static_call(x86_pmu_drain_pebs)(regs, &data);
 
 		/*
 		 * PMI throttle may be triggered, which stops the PEBS event.
@@ -3589,8 +3588,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 	 */
 	if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
 				 (unsigned long *)&status)) {
-		handled++;
-		static_call(x86_pmu_drain_pebs)(regs, &data);
+		handled += static_call(x86_pmu_drain_pebs)(regs, &data);
 
 		if (cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS] &&
 		    is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS]))
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 2851622fbf0f..94ada08360f1 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -3029,7 +3029,7 @@ __intel_pmu_pebs_events(struct perf_event *event,
 	__intel_pmu_pebs_last_event(event, iregs, regs, data, at, count, setup_sample);
 }
 
-static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
@@ -3038,7 +3038,7 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
 	int n;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	at  = (struct pebs_record_core *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_record_core *)(unsigned long)ds->pebs_index;
@@ -3049,22 +3049,24 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
 	ds->pebs_index = ds->pebs_buffer_base;
 
 	if (!test_bit(0, cpuc->active_mask))
-		return;
+		return 0;
 
 	WARN_ON_ONCE(!event);
 
 	if (!event->attr.precise_ip)
-		return;
+		return 0;
 
 	n = top - at;
 	if (n <= 0) {
 		if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
 			intel_pmu_save_and_restart_reload(event, 0);
-		return;
+		return 0;
 	}
 
 	__intel_pmu_pebs_events(event, iregs, data, at, top, 0, n,
 				setup_pebs_fixed_sample_data);
+
+	return 1; /* PMC0 only*/
 }
 
 static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64 mask)
@@ -3087,7 +3089,7 @@ static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64
 	}
 }
 
-static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
@@ -3096,11 +3098,12 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
 	short error[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
 	int max_pebs_events = intel_pmu_max_num_pebs(NULL);
+	u64 events_bitmap = 0;
 	int bit, i, size;
 	u64 mask;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
@@ -3116,7 +3119,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 
 	if (unlikely(base >= top)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
-		return;
+		return 0;
 	}
 
 	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
@@ -3180,6 +3183,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 		if ((counts[bit] == 0) && (error[bit] == 0))
 			continue;
 
+		events_bitmap |= bit;
 		event = cpuc->events[bit];
 		if (WARN_ON_ONCE(!event))
 			continue;
@@ -3201,6 +3205,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 						setup_pebs_fixed_sample_data);
 		}
 	}
+
+	return hweight64(events_bitmap);
 }
 
 static __always_inline void
@@ -3256,7 +3262,7 @@ __intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
 
 DEFINE_PER_CPU(struct x86_perf_regs, pebs_perf_regs);
 
-static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
 	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
@@ -3266,10 +3272,11 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 	struct pt_regs *regs = &perf_regs->regs;
 	struct pebs_basic *basic;
 	void *base, *at, *top;
+	u64 events_bitmap = 0;
 	u64 mask;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
@@ -3282,7 +3289,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 
 	if (unlikely(base >= top)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
-		return;
+		return 0;
 	}
 
 	if (!iregs)
@@ -3297,6 +3304,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 			continue;
 
 		pebs_status = mask & basic->applicable_counters;
+		events_bitmap |= pebs_status;
 		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
 					       pebs_status, counts, last,
 					       setup_pebs_adaptive_sample_data);
@@ -3304,9 +3312,11 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 
 	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
 					    setup_pebs_adaptive_sample_data);
+
+	return hweight64(events_bitmap);
 }
 
-static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
+static int intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 				      struct perf_sample_data *data)
 {
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
@@ -3316,13 +3326,14 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 	struct x86_perf_regs *perf_regs = this_cpu_ptr(&pebs_perf_regs);
 	struct pt_regs *regs = &perf_regs->regs;
 	void *base, *at, *top;
+	u64 events_bitmap = 0;
 	u64 mask;
 
 	rdmsrq(MSR_IA32_PEBS_INDEX, index.whole);
 
 	if (unlikely(!index.wr)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
-		return;
+		return 0;
 	}
 
 	base = cpuc->pebs_vaddr;
@@ -3361,6 +3372,7 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 
 		basic = at + sizeof(struct arch_pebs_header);
 		pebs_status = mask & basic->applicable_counters;
+		events_bitmap |= pebs_status;
 		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
 					       pebs_status, counts, last,
 					       setup_arch_pebs_sample_data);
@@ -3380,6 +3392,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask,
 					    counts, last,
 					    setup_arch_pebs_sample_data);
+
+	return hweight64(events_bitmap);
 }
 
 static void __init intel_arch_pebs_init(void)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index fdfb34d7b1d2..0083334f2d33 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1014,7 +1014,7 @@ struct x86_pmu {
 	int		pebs_record_size;
 	int		pebs_buffer_size;
 	u64		pebs_events_mask;
-	void		(*drain_pebs)(struct pt_regs *regs, struct perf_sample_data *data);
+	int		(*drain_pebs)(struct pt_regs *regs, struct perf_sample_data *data);
 	struct event_constraint *pebs_constraints;
 	void		(*pebs_aliases)(struct perf_event *event);
 	u64		(*pebs_latency_data)(struct perf_event *event, u64 status);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf
  2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (21 preceding siblings ...)
  2026-02-09  7:20 ` [Patch v6 22/22] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
@ 2026-02-09  8:48 ` Mi, Dapeng
  22 siblings, 0 replies; 45+ messages in thread
From: Mi, Dapeng @ 2026-02-09  8:48 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 2/9/2026 3:20 PM, Dapeng Mi wrote:
> Changes since V5:
> - Introduce 3 commits to fix newly found PEBS issues (Patch 01~03/19)
> - Address Peter comments, including,
>   * Fully support user-regs sampling of the SIMD/eGPRs/SSP registers
>   * Adjust newly added fields in perf_event_attr to avoid holes
>   * Fix the endian issue introduced by for_each_set_bit() in
>     event/core.c
>   * Remove some unnecessary macros from UAPI header perf_regs.h
>   * Enhance b2b NMI detection for all PEBS handlers to ensure identical
>     behaviors of all PEBS handlers
> - Split perf-tools patches which would be posted in a separate patchset
>   later

The corresponding perf-tools patch-set:
https://lore.kernel.org/all/20260209083514.2225115-1-dapeng1.mi@linux.intel.com/

Thanks.


>
> Changes since V4:
> - Rewrite some functions comments and commit messages (Dave)
> - Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
> - Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
>   activating back-to-back NMI detection mechanism (Patch 16/19)
> - Fix some minor issues on perf-tool patches (Patch 18/19)
>
> Changes since V3:
> - Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
> - Only dump the available regs, rather than zero and dump the
>   unavailable regs. It's possible that the dumped registers are a subset
>   of the requested registers.
> - Some minor updates to address Dapeng's comments in V3.
>
> Changes since V2:
> - Use the FPU format for the x86_pmu.ext_regs_mask as well
> - Add a check before invoking xsaves_nmi()
> - Add perf_simd_reg_check() to retrieve the number of available
>   registers. If the kernel fails to get the requested registers, e.g.,
>   XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
> - Add POC perf tool patches
>
> Changes since V1:
> - Apply the new interfaces to configure and dump the SIMD registers
> - Utilize the existing FPU functions, e.g., xstate_calculate_size,
>   get_xsave_addr().
>
> Starting from Intel Ice Lake, XMM registers can be collected in a PEBS
> record. Future Architecture PEBS will include additional registers such
> as YMM, ZMM, OPMASK, SSP and APX eGPRs, contingent on hardware support.
>
> This patch set introduces a software solution to mitigate the hardware
> requirement by utilizing the XSAVES command to retrieve the requested
> registers in the overflow handler. This feature is no longer limited to
> PEBS events or specific platforms. While the hardware solution remains
> preferable due to its lower overhead and higher accuracy, this software
> approach provides a viable alternative.
>
> The solution is theoretically compatible with all x86 platforms but is
> currently enabled on newer platforms, including Sapphire Rapids and
> later P-core server platforms, Sierra Forest and later E-core server
> platforms and recent Client platforms, like Arrow Lake, Panther Lake and
> Nova Lake.
>
> Newly supported registers include YMM, ZMM, OPMASK, SSP, and APX eGPRs.
> Due to space constraints in sample_regs_user/intr, new fields have been 
> introduced in the perf_event_attr structure to accommodate these
> registers.
>
> After a long discussion in V1,
> https://lore.kernel.org/lkml/3f1c9a9e-cb63-47ff-a5e9-06555fa6cc9a@linux.intel.com/
> The below new fields are introduced.
>
> @@ -547,6 +549,25 @@ struct perf_event_attr {
>
>         __u64   config3; /* extension of config2 */
>         __u64   config4; /* extension of config3 */
> +
> +       /*
> +        * Defines set of SIMD registers to dump on samples.
> +        * The sample_simd_regs_enabled !=0 implies the
> +        * set of SIMD registers is used to config all SIMD registers.
> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> +        * config some SIMD registers on X86.
> +        */
> +       union {
> +               __u16 sample_simd_regs_enabled;
> +               __u16 sample_simd_pred_reg_qwords;
> +       };
> +       __u16   sample_simd_vec_reg_qwords;
> +       __u32   __reserved_4;
> +
> +       __u32   sample_simd_pred_reg_intr;
> +       __u32   sample_simd_pred_reg_user;
> +       __u64   sample_simd_vec_reg_intr;
> +       __u64   sample_simd_vec_reg_user;
>  };
>
>  /*
> @@ -1020,7 +1041,15 @@ enum perf_event_type {
>          *      } && PERF_SAMPLE_BRANCH_STACK
>          *
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;         # 0 ... weight(sample_simd_vec_reg_user)
> +        *              u16 vector_qwords;      # 0 ... sample_simd_vec_reg_qwords
> +        *              u16 nr_pred;            # 0 ... weight(sample_simd_pred_reg_user)
> +        *              u16 pred_qwords;        # 0 ... sample_simd_pred_reg_qwords
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_USER
>          *
>          *      { u64                   size;
>          *        char                  data[size];
> @@ -1047,7 +1076,15 @@ enum perf_event_type {
>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;         # 0 ... weight(sample_simd_vec_reg_intr)
> +        *              u16 vector_qwords;      # 0 ... sample_simd_vec_reg_qwords
> +        *              u16 nr_pred;            # 0 ... weight(sample_simd_pred_reg_intr)
> +        *              u16 pred_qwords;        # 0 ... sample_simd_pred_reg_qwords
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_INTR
>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>
>
> To maintain simplicity, a single field, sample_{simd|pred}_vec_reg_qwords,
> is introduced to indicate register width. For example:
> - sample_simd_vec_reg_qwords = 2 for XMM registers (128 bits) on x86
> - sample_simd_vec_reg_qwords = 4 for YMM registers (256 bits) on x86
>
> Four additional fields, sample_{simd|pred}_vec_reg_{intr|user}, represent
> the bitmap of sampling registers. For instance, the bitmap for x86
> XMM registers is 0xffff (16 XMM registers). Although users can
> theoretically sample a subset of registers, the current perf-tool
> implementation supports sampling all registers of each type to avoid
> complexity.
>
> A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is introduced to signal user space 
> tools about the presence of SIMD registers in sampling records. When this
> flag is detected, tools should recognize that extra SIMD register data
> follows the general register data. The layout of the extra SIMD register
> data is displayed as follow.
>
>    u16 nr_vectors;
>    u16 vector_qwords;
>    u16 nr_pred;
>    u16 pred_qwords;
>    u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>
> With this patch set, sampling for the aforementioned registers is
> supported on the Intel Nova Lake platform.
>
> Examples:
>  $perf record -I?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
>  $perf record --user-regs=?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
>  $perf record -e branches:p -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -c 100000 ./test
>  $perf report -D
>
>  ... ...
>  14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
>  0xffffffff9f085e24 period: 100000 addr: 0
>  ... intr regs: mask 0x18001010003 ABI 64-bit
>  .... AX    0xdffffc0000000000
>  .... BX    0xffff8882297685e8
>  .... R8    0x0000000000000000
>  .... R16   0x0000000000000000
>  .... R31   0x0000000000000000
>  .... SSP   0x0000000000000000
>  ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
>  .... ZMM  [0] 0xffffffffffffffff
>  .... ZMM  [0] 0x0000000000000001
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [1] 0x003a6b6165506d56
>  ... ...
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... OPMASK[0] 0x00000000fffffe00
>  .... OPMASK[1] 0x0000000000ffffff
>  .... OPMASK[2] 0x000000000000007f
>  .... OPMASK[3] 0x0000000000000000
>  .... OPMASK[4] 0x0000000000010080
>  .... OPMASK[5] 0x0000000000000000
>  .... OPMASK[6] 0x0000400004000000
>  .... OPMASK[7] 0x0000000000000000
>  ... ...
>
>
> History:
>   v5: https://lore.kernel.org/all/20251203065500.2597594-1-dapeng1.mi@linux.intel.com/
>   v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
>   v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
>   v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
>   v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/
>
>
> Dapeng Mi (10):
>   perf/x86/intel: Restrict PEBS_ENABLE writes to PEBS-capable counters
>   perf/x86/intel: Enable large PEBS sampling for XMMs
>   perf/x86/intel: Convert x86_perf_regs to per-cpu variables
>   perf: Eliminate duplicate arch-specific functions definations
>   x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
>   perf/x86: Enable XMM Register Sampling for Non-PEBS Events
>   perf/x86: Enable XMM register sampling for REGS_USER case
>   perf: Enhance perf_reg_validate() with simd_enabled argument
>   perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling
>   perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
>     NMIs
>
> Kan Liang (12):
>   perf/x86: Use x86_perf_regs in the x86 nmi handler
>   perf/x86: Introduce x86-specific x86_pmu_setup_regs_data()
>   x86/fpu/xstate: Add xsaves_nmi() helper
>   perf: Move and rename has_extended_regs() for ARCH-specific use
>   perf: Add sampling support for SIMD registers
>   perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Enable YMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Enable ZMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields
>   perf/x86: Enable eGPRs sampling using sample_regs_* fields
>   perf/x86: Enable SSP sampling using sample_regs_* fields
>   perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
>
>  arch/arm/kernel/perf_regs.c           |   8 +-
>  arch/arm64/kernel/perf_regs.c         |   8 +-
>  arch/csky/kernel/perf_regs.c          |   8 +-
>  arch/loongarch/kernel/perf_regs.c     |   8 +-
>  arch/mips/kernel/perf_regs.c          |   8 +-
>  arch/parisc/kernel/perf_regs.c        |   8 +-
>  arch/powerpc/perf/perf_regs.c         |   2 +-
>  arch/riscv/kernel/perf_regs.c         |   8 +-
>  arch/s390/kernel/perf_regs.c          |   2 +-
>  arch/x86/events/core.c                | 387 +++++++++++++++++++++++++-
>  arch/x86/events/intel/core.c          | 131 ++++++++-
>  arch/x86/events/intel/ds.c            | 164 ++++++++---
>  arch/x86/events/perf_event.h          |  85 +++++-
>  arch/x86/include/asm/fpu/sched.h      |   2 +-
>  arch/x86/include/asm/fpu/xstate.h     |   3 +
>  arch/x86/include/asm/msr-index.h      |   7 +
>  arch/x86/include/asm/perf_event.h     |  38 ++-
>  arch/x86/include/uapi/asm/perf_regs.h |  49 ++++
>  arch/x86/kernel/fpu/core.c            |  12 +-
>  arch/x86/kernel/fpu/xstate.c          |  25 +-
>  arch/x86/kernel/perf_regs.c           | 134 +++++++--
>  include/linux/perf_event.h            |  16 ++
>  include/linux/perf_regs.h             |  36 +--
>  include/uapi/linux/perf_event.h       |  45 ++-
>  kernel/events/core.c                  | 132 ++++++++-
>  25 files changed, 1144 insertions(+), 182 deletions(-)
>
>
> base-commit: 7db06e329af30dcb170a6782c1714217ad65033d

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 01/22] perf/x86/intel: Restrict PEBS_ENABLE writes to PEBS-capable counters
  2026-02-09  7:20 ` [Patch v6 01/22] perf/x86/intel: Restrict PEBS_ENABLE writes to PEBS-capable counters Dapeng Mi
@ 2026-02-10 15:36   ` Peter Zijlstra
  2026-02-11  5:47     ` Mi, Dapeng
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2026-02-10 15:36 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

On Mon, Feb 09, 2026 at 03:20:26PM +0800, Dapeng Mi wrote:
> Before the introduction of extended PEBS, PEBS supported only
> general-purpose (GP) counters. In a virtual machine (VM) environment,
> the PEBS_BASELINE bit in PERF_CAPABILITIES may not be set, but the PEBS
> format could be indicated as 4 or higher. In such cases, PEBS events
> might be scheduled to fixed counters, and writing the corresponding bits
> into the PEBS_ENABLE MSR could cause a #GP fault.
> 
> To prevent writing unsupported bits into the PEBS_ENABLE MSR, ensure
> cpuc->pebs_enabled aligns with x86_pmu.pebs_capable and restrict the
> writes to only PEBS-capable counter bits.

This seems very wrong. Should we not avoid getting those bits set in the
first place?

That is; the fact that we set those cpuc->pebs_enabled bits indicates
that we 'successfully' scheduled PEBS counters. And then we silently
disable PEBS when programming the hardware.

Or am I reading this wrong?

> 
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
> 
> V6: new patch.
> 
>  arch/x86/events/intel/core.c |  6 ++++--
>  arch/x86/events/intel/ds.c   | 11 +++++++----
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index f3ae1f8ee3cd..546ebc7e1624 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3554,8 +3554,10 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>  		 * cpuc->enabled has been forced to 0 in PMI.
>  		 * Update the MSR if pebs_enabled is changed.
>  		 */
> -		if (pebs_enabled != cpuc->pebs_enabled)
> -			wrmsrq(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
> +		if (pebs_enabled != cpuc->pebs_enabled) {
> +			wrmsrq(MSR_IA32_PEBS_ENABLE,
> +			       cpuc->pebs_enabled & x86_pmu.pebs_capable);
> +		}
>  
>  		/*
>  		 * Above PEBS handler (PEBS counters snapshotting) has updated fixed
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 5027afc97b65..57805c6ba0c3 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1963,6 +1963,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
>  {
>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>  	struct hw_perf_event *hwc = &event->hw;
> +	u64 pebs_enabled;
>  
>  	__intel_pmu_pebs_disable(event);
>  
> @@ -1974,16 +1975,18 @@ void intel_pmu_pebs_disable(struct perf_event *event)
>  
>  	intel_pmu_pebs_via_pt_disable(event);
>  
> -	if (cpuc->enabled)
> -		wrmsrq(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
> +	pebs_enabled = cpuc->pebs_enabled & x86_pmu.pebs_capable;
> +	if (pebs_enabled)
> +		wrmsrq(MSR_IA32_PEBS_ENABLE, pebs_enabled);
>  }
>  
>  void intel_pmu_pebs_enable_all(void)
>  {
>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +	u64 pebs_enabled = cpuc->pebs_enabled & x86_pmu.pebs_capable;
>  
> -	if (cpuc->pebs_enabled)
> -		wrmsrq(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
> +	if (pebs_enabled)
> +		wrmsrq(MSR_IA32_PEBS_ENABLE, pebs_enabled);
>  }
>  
>  void intel_pmu_pebs_disable_all(void)
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 05/22] perf/x86: Use x86_perf_regs in the x86 nmi handler
  2026-02-09  7:20 ` [Patch v6 05/22] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
@ 2026-02-10 18:40   ` Peter Zijlstra
  2026-02-11  6:26     ` Mi, Dapeng
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2026-02-10 18:40 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Mon, Feb 09, 2026 at 03:20:30PM +0800, Dapeng Mi wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> More and more regs will be supported in the overflow, e.g., more vector
> registers, SSP, etc. The generic pt_regs struct cannot store all of
> them. Use a X86 specific x86_perf_regs instead.
> 
> The struct pt_regs *regs is still passed to x86_pmu_handle_irq(). There
> is no functional change for the existing code.
> 
> AMD IBS's NMI handler doesn't utilize the static call
> x86_pmu_handle_irq(). The x86_perf_regs struct doesn't apply to the AMD
> IBS. It can be added separately later when AMD IBS supports more regs.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/events/core.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 6df73e8398cd..8c80d22864d8 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1785,6 +1785,7 @@ EXPORT_SYMBOL_FOR_KVM(perf_put_guest_lvtpc);
>  static int
>  perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
>  {
> +	struct x86_perf_regs x86_regs;
>  	u64 start_clock;

So a few patches ago you pulled this off stack because too large, and
then here you stick it on stack again.

That is a wee bit inconsistent.

Furthermore, I think you can re-purpose that same off-stack copy. After
all, the pebs_drain thing can only happen:

 - from NMI (like here);
 - from context switch, when PMU is disabled (and thus no NMIs).

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 12/22] perf: Add sampling support for SIMD registers
  2026-02-09  7:20 ` [Patch v6 12/22] perf: Add sampling support for SIMD registers Dapeng Mi
@ 2026-02-10 20:04   ` Peter Zijlstra
  2026-02-11  6:56     ` Mi, Dapeng
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2026-02-10 20:04 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Mon, Feb 09, 2026 at 03:20:37PM +0800, Dapeng Mi wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index d487c55a4f3e..5742126f50cc 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7761,6 +7761,50 @@ perf_output_sample_regs(struct perf_output_handle *handle,
>  	}
>  }
>  
> +static void
> +perf_output_sample_simd_regs(struct perf_output_handle *handle,
> +			     struct perf_event *event,
> +			     struct pt_regs *regs,
> +			     u64 mask, u32 pred_mask)
> +{
> +	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
> +	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
> +	u16 nr_vectors;
> +	u16 nr_pred;
> +	int bit;
> +	u64 val;
> +	u16 i;
> +
> +	nr_vectors = hweight64(mask);
> +	nr_pred = hweight32(pred_mask);
> +
> +	perf_output_put(handle, nr_vectors);
> +	perf_output_put(handle, vec_qwords);
> +	perf_output_put(handle, nr_pred);
> +	perf_output_put(handle, pred_qwords);
> +
> +	if (nr_vectors) {
> +		for (bit = 0; bit < sizeof(mask) * BITS_PER_BYTE; bit++) {
> +			if (!(BIT_ULL(bit) & mask))
> +				continue;
> +			for (i = 0; i < vec_qwords; i++) {
> +				val = perf_simd_reg_value(regs, bit, i, false);
> +				perf_output_put(handle, val);
> +			}
> +		}
> +	}
> +	if (nr_pred) {
> +		for (bit = 0; bit < sizeof(pred_mask) * BITS_PER_BYTE; bit++) {
> +			if (!(BIT(bit) & pred_mask))
> +				continue;
> +			for (i = 0; i < pred_qwords; i++) {
> +				val = perf_simd_reg_value(regs, bit, i, true);
> +				perf_output_put(handle, val);
> +			}
> +		}
> +	}
> +}

Yeah, that works, but it does make me sad. The existing
perf_output_sample_regs() has yet another solution.

Wondering how hard it could possibly be to write a for_each_set_bit()
variant that works on a given word (instead of an array), I did the
below.

It works (at least, the assembly looks about right); but I'm not sure
its all I had hoped for either :-(

---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7754,18 +7754,27 @@ void __weak perf_get_regs_user(struct pe
 	regs_user->abi = perf_reg_abi(current);
 }
 
+/* Until GCC-14+/clang-19+, which have __builtin_ctzg() */
+#define __ctzg(val, def) \
+	(val) ? _Generic((val), \
+			 unsigned int: __builtin_ctz(val), \
+			 unsigned long: __builtin_ctzl(val), \
+			 unsigned long long: __builtin_ctzll(val)) : (def)
+
+#define __next_bit(val, bit) \
+	({ auto __v = (val); \
+	   __v &= GENMASK(sizeof(__v) * BITS_PER_BYTE - 1, bit); \
+	   __ctzg(__v, -1); })
+
+#define word_for_each_set_bit(bit, val) \
+	for (int bit = 0; bit = __next_bit(val, bit), bit >= 0; bit++)
+
 static void
 perf_output_sample_regs(struct perf_output_handle *handle,
 			struct pt_regs *regs, u64 mask)
 {
-	int bit;
-	DECLARE_BITMAP(_mask, 64);
-
-	bitmap_from_u64(_mask, mask);
-	for_each_set_bit(bit, _mask, sizeof(mask) * BITS_PER_BYTE) {
-		u64 val;
-
-		val = perf_reg_value(regs, bit);
+	word_for_each_set_bit(bit, mask) {
+		u64 val = perf_reg_value(regs, bit);
 		perf_output_put(handle, val);
 	}
 }
@@ -7778,14 +7787,8 @@ perf_output_sample_simd_regs(struct perf
 {
 	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
 	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
-	u16 nr_vectors;
-	u16 nr_pred;
-	int bit;
-	u64 val;
-	u16 i;
-
-	nr_vectors = hweight64(mask);
-	nr_pred = hweight32(pred_mask);
+	u16 nr_vectors = hweight64(mask);
+	u16 nr_pred = hweight32(pred_mask);
 
 	perf_output_put(handle, nr_vectors);
 	perf_output_put(handle, vec_qwords);
@@ -7793,21 +7796,17 @@ perf_output_sample_simd_regs(struct perf
 	perf_output_put(handle, pred_qwords);
 
 	if (nr_vectors) {
-		for (bit = 0; bit < sizeof(mask) * BITS_PER_BYTE; bit++) {
-			if (!(BIT_ULL(bit) & mask))
-				continue;
-			for (i = 0; i < vec_qwords; i++) {
-				val = perf_simd_reg_value(regs, bit, i, false);
+		word_for_each_set_bit(bit, mask) {
+			for (int i = 0; i < vec_qwords; i++) {
+				u64 val = perf_simd_reg_value(regs, bit, i, false);
 				perf_output_put(handle, val);
 			}
 		}
 	}
 	if (nr_pred) {
-		for (bit = 0; bit < sizeof(pred_mask) * BITS_PER_BYTE; bit++) {
-			if (!(BIT(bit) & pred_mask))
-				continue;
-			for (i = 0; i < pred_qwords; i++) {
-				val = perf_simd_reg_value(regs, bit, i, true);
+		word_for_each_set_bit(bit, pred_mask) {
+			for (int i = 0; i < pred_qwords; i++) {
+				u64 val = perf_simd_reg_value(regs, bit, i, true);
 				perf_output_put(handle, val);
 			}
 		}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 01/22] perf/x86/intel: Restrict PEBS_ENABLE writes to PEBS-capable counters
  2026-02-10 15:36   ` Peter Zijlstra
@ 2026-02-11  5:47     ` Mi, Dapeng
  0 siblings, 0 replies; 45+ messages in thread
From: Mi, Dapeng @ 2026-02-11  5:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 2/10/2026 11:36 PM, Peter Zijlstra wrote:
> On Mon, Feb 09, 2026 at 03:20:26PM +0800, Dapeng Mi wrote:
>> Before the introduction of extended PEBS, PEBS supported only
>> general-purpose (GP) counters. In a virtual machine (VM) environment,
>> the PEBS_BASELINE bit in PERF_CAPABILITIES may not be set, but the PEBS
>> format could be indicated as 4 or higher. In such cases, PEBS events
>> might be scheduled to fixed counters, and writing the corresponding bits
>> into the PEBS_ENABLE MSR could cause a #GP fault.
>>
>> To prevent writing unsupported bits into the PEBS_ENABLE MSR, ensure
>> cpuc->pebs_enabled aligns with x86_pmu.pebs_capable and restrict the
>> writes to only PEBS-capable counter bits.
> This seems very wrong. Should we not avoid getting those bits set in the
> first place?

Hmm, yes. I originally thought it's fine to block the access these invalid
bits in PEBS_ENABLE MSR, but I agree it should be blocked as early as possible.

Currently the intel_pebs_constraints() helper doesn't check if the returned
matched PEBS constraint contains the fixed counter indexes when extended
PEBS is not supported.

We may need the below change (just build and not tested yet). 

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 94ada08360f1..bc36808bdb7b 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1557,6 +1557,14 @@ struct event_constraint
*intel_pebs_constraints(struct perf_event *event)
        if (pebs_constraints) {
                for_each_event_constraint(c, pebs_constraints) {
                        if (constraint_match(c, event->hw.config)) {
+                               /*
+                                * If fixed counters are suggested in the
constraints,
+                                * but extended PEBS is not supported,
emptyconstraint
+                                * should be returned.
+                                */
+                               if ((c->idxmsk64 & ~PEBS_COUNTER_MASK) &&
+                                   !(x86_pmu.flags & PMU_FL_PEBS_ALL))
+                                       break;
                                event->hw.flags |= c->flags;
                                return c;
                        }

Thanks.


>
> That is; the fact that we set those cpuc->pebs_enabled bits indicates
> that we 'successfully' scheduled PEBS counters. And then we silently
> disable PEBS when programming the hardware.
>
> Or am I reading this wrong?
>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>
>> V6: new patch.
>>
>>  arch/x86/events/intel/core.c |  6 ++++--
>>  arch/x86/events/intel/ds.c   | 11 +++++++----
>>  2 files changed, 11 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index f3ae1f8ee3cd..546ebc7e1624 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -3554,8 +3554,10 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>>  		 * cpuc->enabled has been forced to 0 in PMI.
>>  		 * Update the MSR if pebs_enabled is changed.
>>  		 */
>> -		if (pebs_enabled != cpuc->pebs_enabled)
>> -			wrmsrq(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>> +		if (pebs_enabled != cpuc->pebs_enabled) {
>> +			wrmsrq(MSR_IA32_PEBS_ENABLE,
>> +			       cpuc->pebs_enabled & x86_pmu.pebs_capable);
>> +		}
>>  
>>  		/*
>>  		 * Above PEBS handler (PEBS counters snapshotting) has updated fixed
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index 5027afc97b65..57805c6ba0c3 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1963,6 +1963,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
>>  {
>>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>  	struct hw_perf_event *hwc = &event->hw;
>> +	u64 pebs_enabled;
>>  
>>  	__intel_pmu_pebs_disable(event);
>>  
>> @@ -1974,16 +1975,18 @@ void intel_pmu_pebs_disable(struct perf_event *event)
>>  
>>  	intel_pmu_pebs_via_pt_disable(event);
>>  
>> -	if (cpuc->enabled)
>> -		wrmsrq(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>> +	pebs_enabled = cpuc->pebs_enabled & x86_pmu.pebs_capable;
>> +	if (pebs_enabled)
>> +		wrmsrq(MSR_IA32_PEBS_ENABLE, pebs_enabled);
>>  }
>>  
>>  void intel_pmu_pebs_enable_all(void)
>>  {
>>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> +	u64 pebs_enabled = cpuc->pebs_enabled & x86_pmu.pebs_capable;
>>  
>> -	if (cpuc->pebs_enabled)
>> -		wrmsrq(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>> +	if (pebs_enabled)
>> +		wrmsrq(MSR_IA32_PEBS_ENABLE, pebs_enabled);
>>  }
>>  
>>  void intel_pmu_pebs_disable_all(void)
>> -- 
>> 2.34.1
>>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [Patch v6 05/22] perf/x86: Use x86_perf_regs in the x86 nmi handler
  2026-02-10 18:40   ` Peter Zijlstra
@ 2026-02-11  6:26     ` Mi, Dapeng
  0 siblings, 0 replies; 45+ messages in thread
From: Mi, Dapeng @ 2026-02-11  6:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 2/11/2026 2:40 AM, Peter Zijlstra wrote:
> On Mon, Feb 09, 2026 at 03:20:30PM +0800, Dapeng Mi wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> More and more regs will be supported in the overflow, e.g., more vector
>> registers, SSP, etc. The generic pt_regs struct cannot store all of
>> them. Use a X86 specific x86_perf_regs instead.
>>
>> The struct pt_regs *regs is still passed to x86_pmu_handle_irq(). There
>> is no functional change for the existing code.
>>
>> AMD IBS's NMI handler doesn't utilize the static call
>> x86_pmu_handle_irq(). The x86_perf_regs struct doesn't apply to the AMD
>> IBS. It can be added separately later when AMD IBS supports more regs.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/events/core.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 6df73e8398cd..8c80d22864d8 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -1785,6 +1785,7 @@ EXPORT_SYMBOL_FOR_KVM(perf_put_guest_lvtpc);
>>  static int
>>  perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
>>  {
>> +	struct x86_perf_regs x86_regs;
>>  	u64 start_clock;
> So a few patches ago you pulled this off stack because too large, and
> then here you stick it on stack again.
>
> That is a wee bit inconsistent.

Oh, yes. Just miss this place since no warning is reported here. Thanks.


>
> Furthermore, I think you can re-purpose that same off-stack copy. After
> all, the pebs_drain thing can only happen:
>
>  - from NMI (like here);
>  - from context switch, when PMU is disabled (and thus no NMIs).

I'm not sure if we can use only one x86_perf_regs instance for both PEBS
and non-PEBS sampling. It may be not. When PEBS and non-PEBS events are
overflowed simultaneously in a PMI, the GPRs' value of non-PEBS event could
be overwritten by the GPRs's value of PEBS events if non-PEBS events and
PEBS events share same x86_perf_regs instance. I need a further check on this.

Thanks.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 12/22] perf: Add sampling support for SIMD registers
  2026-02-10 20:04   ` Peter Zijlstra
@ 2026-02-11  6:56     ` Mi, Dapeng
  0 siblings, 0 replies; 45+ messages in thread
From: Mi, Dapeng @ 2026-02-11  6:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 2/11/2026 4:04 AM, Peter Zijlstra wrote:
> On Mon, Feb 09, 2026 at 03:20:37PM +0800, Dapeng Mi wrote:
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index d487c55a4f3e..5742126f50cc 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -7761,6 +7761,50 @@ perf_output_sample_regs(struct perf_output_handle *handle,
>>  	}
>>  }
>>  
>> +static void
>> +perf_output_sample_simd_regs(struct perf_output_handle *handle,
>> +			     struct perf_event *event,
>> +			     struct pt_regs *regs,
>> +			     u64 mask, u32 pred_mask)
>> +{
>> +	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
>> +	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
>> +	u16 nr_vectors;
>> +	u16 nr_pred;
>> +	int bit;
>> +	u64 val;
>> +	u16 i;
>> +
>> +	nr_vectors = hweight64(mask);
>> +	nr_pred = hweight32(pred_mask);
>> +
>> +	perf_output_put(handle, nr_vectors);
>> +	perf_output_put(handle, vec_qwords);
>> +	perf_output_put(handle, nr_pred);
>> +	perf_output_put(handle, pred_qwords);
>> +
>> +	if (nr_vectors) {
>> +		for (bit = 0; bit < sizeof(mask) * BITS_PER_BYTE; bit++) {
>> +			if (!(BIT_ULL(bit) & mask))
>> +				continue;
>> +			for (i = 0; i < vec_qwords; i++) {
>> +				val = perf_simd_reg_value(regs, bit, i, false);
>> +				perf_output_put(handle, val);
>> +			}
>> +		}
>> +	}
>> +	if (nr_pred) {
>> +		for (bit = 0; bit < sizeof(pred_mask) * BITS_PER_BYTE; bit++) {
>> +			if (!(BIT(bit) & pred_mask))
>> +				continue;
>> +			for (i = 0; i < pred_qwords; i++) {
>> +				val = perf_simd_reg_value(regs, bit, i, true);
>> +				perf_output_put(handle, val);
>> +			}
>> +		}
>> +	}
>> +}
> Yeah, that works, but it does make me sad. The existing
> perf_output_sample_regs() has yet another solution.
>
> Wondering how hard it could possibly be to write a for_each_set_bit()
> variant that works on a given word (instead of an array), I did the
> below.
>
> It works (at least, the assembly looks about right); but I'm not sure
> its all I had hoped for either :-(

Pretty code! It looks I still haven't gotten used to writing such kind of
macros.

The code looks good to me, I would test it later. Thanks.


>
> ---
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7754,18 +7754,27 @@ void __weak perf_get_regs_user(struct pe
>  	regs_user->abi = perf_reg_abi(current);
>  }
>  
> +/* Until GCC-14+/clang-19+, which have __builtin_ctzg() */
> +#define __ctzg(val, def) \
> +	(val) ? _Generic((val), \
> +			 unsigned int: __builtin_ctz(val), \
> +			 unsigned long: __builtin_ctzl(val), \
> +			 unsigned long long: __builtin_ctzll(val)) : (def)
> +
> +#define __next_bit(val, bit) \
> +	({ auto __v = (val); \
> +	   __v &= GENMASK(sizeof(__v) * BITS_PER_BYTE - 1, bit); \
> +	   __ctzg(__v, -1); })
> +
> +#define word_for_each_set_bit(bit, val) \
> +	for (int bit = 0; bit = __next_bit(val, bit), bit >= 0; bit++)
> +
>  static void
>  perf_output_sample_regs(struct perf_output_handle *handle,
>  			struct pt_regs *regs, u64 mask)
>  {
> -	int bit;
> -	DECLARE_BITMAP(_mask, 64);
> -
> -	bitmap_from_u64(_mask, mask);
> -	for_each_set_bit(bit, _mask, sizeof(mask) * BITS_PER_BYTE) {
> -		u64 val;
> -
> -		val = perf_reg_value(regs, bit);
> +	word_for_each_set_bit(bit, mask) {
> +		u64 val = perf_reg_value(regs, bit);
>  		perf_output_put(handle, val);
>  	}
>  }
> @@ -7778,14 +7787,8 @@ perf_output_sample_simd_regs(struct perf
>  {
>  	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
>  	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
> -	u16 nr_vectors;
> -	u16 nr_pred;
> -	int bit;
> -	u64 val;
> -	u16 i;
> -
> -	nr_vectors = hweight64(mask);
> -	nr_pred = hweight32(pred_mask);
> +	u16 nr_vectors = hweight64(mask);
> +	u16 nr_pred = hweight32(pred_mask);
>  
>  	perf_output_put(handle, nr_vectors);
>  	perf_output_put(handle, vec_qwords);
> @@ -7793,21 +7796,17 @@ perf_output_sample_simd_regs(struct perf
>  	perf_output_put(handle, pred_qwords);
>  
>  	if (nr_vectors) {
> -		for (bit = 0; bit < sizeof(mask) * BITS_PER_BYTE; bit++) {
> -			if (!(BIT_ULL(bit) & mask))
> -				continue;
> -			for (i = 0; i < vec_qwords; i++) {
> -				val = perf_simd_reg_value(regs, bit, i, false);
> +		word_for_each_set_bit(bit, mask) {
> +			for (int i = 0; i < vec_qwords; i++) {
> +				u64 val = perf_simd_reg_value(regs, bit, i, false);
>  				perf_output_put(handle, val);
>  			}
>  		}
>  	}
>  	if (nr_pred) {
> -		for (bit = 0; bit < sizeof(pred_mask) * BITS_PER_BYTE; bit++) {
> -			if (!(BIT(bit) & pred_mask))
> -				continue;
> -			for (i = 0; i < pred_qwords; i++) {
> -				val = perf_simd_reg_value(regs, bit, i, true);
> +		word_for_each_set_bit(bit, pred_mask) {
> +			for (int i = 0; i < pred_qwords; i++) {
> +				u64 val = perf_simd_reg_value(regs, bit, i, true);
>  				perf_output_put(handle, val);
>  			}
>  		}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  2026-02-09  7:20 ` [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state Dapeng Mi
@ 2026-02-11 19:39   ` Chang S. Bae
  2026-02-11 19:55     ` Dave Hansen
  2026-02-24  5:35     ` Mi, Dapeng
  0 siblings, 2 replies; 45+ messages in thread
From: Chang S. Bae @ 2026-02-11 19:39 UTC (permalink / raw)
  To: Dapeng Mi, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

On 2/8/2026 11:20 PM, Dapeng Mi wrote:
> Ensure that the TIF_NEED_FPU_LOAD flag is always set after saving the
> FPU state. This guarantees that the user space FPU state has been saved
> whenever the TIF_NEED_FPU_LOAD flag is set.
> 
> A subsequent patch will verify if the user space FPU state can be
> retrieved from the saved task FPU state in the NMI context by checking
> the TIF_NEED_FPU_LOAD flag.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>

I could check some previous discussions there:
https://lore.kernel.org/all/20251204151735.GO2528459@noisy.programming.kicks-ass.net/

With that, I think I can get the context of this change -- tightening 
the flag up with the buffer against *NMIs*.

But the changelog doesn't look clear enough to explain the reason to me.

At least, I guess Links to the past discussions provides better context 
to readers.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  2026-02-11 19:39   ` Chang S. Bae
@ 2026-02-11 19:55     ` Dave Hansen
  2026-02-24  6:50       ` Mi, Dapeng
  2026-02-25 13:02       ` Peter Zijlstra
  2026-02-24  5:35     ` Mi, Dapeng
  1 sibling, 2 replies; 45+ messages in thread
From: Dave Hansen @ 2026-02-11 19:55 UTC (permalink / raw)
  To: Chang S. Bae, Dapeng Mi, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Dave Hansen, Ian Rogers, Adrian Hunter, Jiri Olsa,
	Alexander Shishkin, Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

On 2/11/26 11:39, Chang S. Bae wrote:
> But the changelog doesn't look clear enough to explain the reason to me.

Yeah, the changelog could use some improvement.

This also leaves the code rather fragile. Right now, all of the:

	save_fpregs_to_fpstate(fpu);
	set_thread_flag(TIF_NEED_FPU_LOAD);

pairs are open-coded. There's precisely nothing stopping someone from
coming in tomorrow and reversing the order at these or other call sites.
There's also zero comments left in the code to tell folks not to do this.

Are there enough of these to have a helper that does:

	save_fpregs_to_fpstate_before_invalidation(fpu);

which could do the save and set TIF_NEED_FPU_LOAD? (that name is awful
btw, please don't use it).

This:

>  	/* Swap fpstate */
>  	if (enter_guest) {
> -		fpu->__task_fpstate = cur_fps;
> +		WRITE_ONCE(fpu->__task_fpstate, cur_fps);
> +		barrier();
>  		fpu->fpstate = guest_fps;
>  		guest_fps->in_use = true;
>  	} else {
>  		guest_fps->in_use = false;
>  		fpu->fpstate = fpu->__task_fpstate;
> -		fpu->__task_fpstate = NULL;
> +		barrier();
> +		WRITE_ONCE(fpu->__task_fpstate, NULL);
>  	}

also urgently needs comments.

I also can't help but think that there might be a nicer way to do that
without the barrier(). I _think_ two correctly-ordered WRITE_ONCE()'s
would make the compiler do the same thing as the barrier().

But I'm not fully understanding what the barrier() is doing anyway, so
take that with a grain of salt.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-02-09  7:20 ` [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events Dapeng Mi
@ 2026-02-15 23:58   ` Chang S. Bae
  2026-02-24  7:11     ` Mi, Dapeng
  0 siblings, 1 reply; 45+ messages in thread
From: Chang S. Bae @ 2026-02-15 23:58 UTC (permalink / raw)
  To: Dapeng Mi, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On 2/8/2026 11:20 PM, Dapeng Mi wrote:
> 
> This patch supports XMM sampling for non-PEBS events in the `REGS_INTR`

Please avoid "This patch".

...

> +static inline void __x86_pmu_sample_ext_regs(u64 mask)
> +{
> +	struct xregs_state *xsave = per_cpu(ext_regs_buf, smp_processor_id());
> +
> +	if (WARN_ON_ONCE(!xsave))
> +		return;
> +
> +	xsaves_nmi(xsave, mask);
> +}
> +
> +static inline void x86_pmu_update_ext_regs(struct x86_perf_regs *perf_regs,
> +					   struct xregs_state *xsave, u64 bitmap)
> +{
> +	u64 mask;
> +
> +	if (!xsave)
> +		return;
> +
> +	/* Filtered by what XSAVE really gives */
> +	mask = bitmap & xsave->header.xfeatures;
> +
> +	if (mask & XFEATURE_MASK_SSE)
> +		perf_regs->xmm_space = xsave->i387.xmm_space;
> +}
> +
> +static void x86_pmu_sample_extended_regs(struct perf_event *event,
> +					 struct perf_sample_data *data,
> +					 struct pt_regs *regs,
> +					 u64 ignore_mask)
> +{

...

> +
> +	if (intr_mask) {
> +		__x86_pmu_sample_ext_regs(intr_mask);
> +		xsave = per_cpu(ext_regs_buf, smp_processor_id());
> +		x86_pmu_update_ext_regs(perf_regs, xsave, intr_mask);

These three lines appear to just update xcomponent's _pointers_ to an 
xsave storage:

   * Retrieve a per-cpu XSAVE buffer if valid
   * Ensure the component's presence against XSTATE_BV
   * Then, update pointers in pref_regs.

which could be done in a function with more descriptive name. Given 
that, I don't think__x86_pmu_sample_ext_regs() has any point. I don't 
understand x86_pmu_sample_extended_regs() naming, either.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  2026-02-11 19:39   ` Chang S. Bae
  2026-02-11 19:55     ` Dave Hansen
@ 2026-02-24  5:35     ` Mi, Dapeng
  2026-02-24 19:13       ` Chang S. Bae
  1 sibling, 1 reply; 45+ messages in thread
From: Mi, Dapeng @ 2026-02-24  5:35 UTC (permalink / raw)
  To: Chang S. Bae, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Dave Hansen, Ian Rogers, Adrian Hunter, Jiri Olsa,
	Alexander Shishkin, Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 2/12/2026 3:39 AM, Chang S. Bae wrote:
> On 2/8/2026 11:20 PM, Dapeng Mi wrote:
>> Ensure that the TIF_NEED_FPU_LOAD flag is always set after saving the
>> FPU state. This guarantees that the user space FPU state has been saved
>> whenever the TIF_NEED_FPU_LOAD flag is set.
>>
>> A subsequent patch will verify if the user space FPU state can be
>> retrieved from the saved task FPU state in the NMI context by checking
>> the TIF_NEED_FPU_LOAD flag.
>>
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> I could check some previous discussions there:
> https://lore.kernel.org/all/20251204151735.GO2528459@noisy.programming.kicks-ass.net/
>
> With that, I think I can get the context of this change -- tightening 
> the flag up with the buffer against *NMIs*.
>
> But the changelog doesn't look clear enough to explain the reason to me.
>
> At least, I guess Links to the past discussions provides better context 
> to readers.

Thanks for the reminding. I would add a "Closes" tag here.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  2026-02-11 19:55     ` Dave Hansen
@ 2026-02-24  6:50       ` Mi, Dapeng
  2026-02-25 13:02       ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Mi, Dapeng @ 2026-02-24  6:50 UTC (permalink / raw)
  To: Dave Hansen, Chang S. Bae, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Dave Hansen, Ian Rogers, Adrian Hunter, Jiri Olsa,
	Alexander Shishkin, Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 2/12/2026 3:55 AM, Dave Hansen wrote:
> On 2/11/26 11:39, Chang S. Bae wrote:
>> But the changelog doesn't look clear enough to explain the reason to me.
> Yeah, the changelog could use some improvement.
>
> This also leaves the code rather fragile. Right now, all of the:
>
> 	save_fpregs_to_fpstate(fpu);
> 	set_thread_flag(TIF_NEED_FPU_LOAD);
>
> pairs are open-coded. There's precisely nothing stopping someone from
> coming in tomorrow and reversing the order at these or other call sites.
> There's also zero comments left in the code to tell folks not to do this.
>
> Are there enough of these to have a helper that does:
>
> 	save_fpregs_to_fpstate_before_invalidation(fpu);
>
> which could do the save and set TIF_NEED_FPU_LOAD? (that name is awful
> btw, please don't use it).

Sure. Would introduce an inline function and add comments. Thanks.


>
> This:
>
>>  	/* Swap fpstate */
>>  	if (enter_guest) {
>> -		fpu->__task_fpstate = cur_fps;
>> +		WRITE_ONCE(fpu->__task_fpstate, cur_fps);
>> +		barrier();
>>  		fpu->fpstate = guest_fps;
>>  		guest_fps->in_use = true;
>>  	} else {
>>  		guest_fps->in_use = false;
>>  		fpu->fpstate = fpu->__task_fpstate;
>> -		fpu->__task_fpstate = NULL;
>> +		barrier();
>> +		WRITE_ONCE(fpu->__task_fpstate, NULL);
>>  	}
> also urgently needs comments.
>
> I also can't help but think that there might be a nicer way to do that
> without the barrier(). I _think_ two correctly-ordered WRITE_ONCE()'s
> would make the compiler do the same thing as the barrier().
>
> But I'm not fully understanding what the barrier() is doing anyway, so
> take that with a grain of salt.

IMO, barrier() seems a safer way for this situation. I'm not quite sure if
two WRITE_ONCE()'s would block the reordering from compiler, so I asked
Gemini, and here is its answer.

"

WRITE_ONCE() provides a certain level of protection, but it isn't a full
barrier.

WRITE_ONCE(): Prevents the compiler from optimizing the store away or
splitting it into multiple instructions.

Two WRITE_ONCE() calls: The compiler is generally discouraged from
reordering two volatile-style accesses (which WRITE_ONCE uses), but the C
standard doesn't strictly guarantee it in all cases.

"

Thanks.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-02-15 23:58   ` Chang S. Bae
@ 2026-02-24  7:11     ` Mi, Dapeng
  2026-02-24 19:13       ` Chang S. Bae
  0 siblings, 1 reply; 45+ messages in thread
From: Mi, Dapeng @ 2026-02-24  7:11 UTC (permalink / raw)
  To: Chang S. Bae, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Dave Hansen, Ian Rogers, Adrian Hunter, Jiri Olsa,
	Alexander Shishkin, Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 2/16/2026 7:58 AM, Chang S. Bae wrote:
> On 2/8/2026 11:20 PM, Dapeng Mi wrote:
>> This patch supports XMM sampling for non-PEBS events in the `REGS_INTR`
> Please avoid "This patch".

Sure.


>
> ...
>
>> +static inline void __x86_pmu_sample_ext_regs(u64 mask)
>> +{
>> +	struct xregs_state *xsave = per_cpu(ext_regs_buf, smp_processor_id());
>> +
>> +	if (WARN_ON_ONCE(!xsave))
>> +		return;
>> +
>> +	xsaves_nmi(xsave, mask);
>> +}
>> +
>> +static inline void x86_pmu_update_ext_regs(struct x86_perf_regs *perf_regs,
>> +					   struct xregs_state *xsave, u64 bitmap)
>> +{
>> +	u64 mask;
>> +
>> +	if (!xsave)
>> +		return;
>> +
>> +	/* Filtered by what XSAVE really gives */
>> +	mask = bitmap & xsave->header.xfeatures;
>> +
>> +	if (mask & XFEATURE_MASK_SSE)
>> +		perf_regs->xmm_space = xsave->i387.xmm_space;
>> +}
>> +
>> +static void x86_pmu_sample_extended_regs(struct perf_event *event,
>> +					 struct perf_sample_data *data,
>> +					 struct pt_regs *regs,
>> +					 u64 ignore_mask)
>> +{
> ...
>
>> +
>> +	if (intr_mask) {
>> +		__x86_pmu_sample_ext_regs(intr_mask);
>> +		xsave = per_cpu(ext_regs_buf, smp_processor_id());
>> +		x86_pmu_update_ext_regs(perf_regs, xsave, intr_mask);
> These three lines appear to just update xcomponent's _pointers_ to an 
> xsave storage:
>
>    * Retrieve a per-cpu XSAVE buffer if valid
>    * Ensure the component's presence against XSTATE_BV
>    * Then, update pointers in pref_regs.
>
> which could be done in a function with more descriptive name. Given 
> that, I don't think__x86_pmu_sample_ext_regs() has any point. I don't 
> understand x86_pmu_sample_extended_regs() naming, either.

Hmm, partially agree, __x86_pmu_sample_ext_regs() can be dropped. The
intent of this whole function is to sample the required SIMD and APX eGPRs
registers by leveraging xsaves instruction in a NMI handler context. All
the SIMD and eGPRs registers are collectively called "extended registers"
here. Not sure if the "extended registers" is a good word, please suggest
if there is a better one.

Thanks.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  2026-02-24  5:35     ` Mi, Dapeng
@ 2026-02-24 19:13       ` Chang S. Bae
  2026-02-25  0:35         ` Mi, Dapeng
  0 siblings, 1 reply; 45+ messages in thread
From: Chang S. Bae @ 2026-02-24 19:13 UTC (permalink / raw)
  To: Mi, Dapeng, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

On 2/23/2026 9:35 PM, Mi, Dapeng wrote:
> 
> Thanks for the reminding. I would add a "Closes" tag here.

But Closes is for a bug fix:

Another tag is used for linking web pages with additional backgrounds or 
details, for example an earlier discussion which leads to the patch or a 
document with a specification implemented by the patch:

Link: https://example.com/somewhere.html  optional-other-stuff

As per guidance from the Chief Penguin, a Link: tag should only be added 
to a commit if it leads to useful information that is not found in the 
commit itself.

If the URL points to a public bug report being fixed by the patch, use 
the “Closes:” tag instead:

Closes: https://example.com/issues/1234  optional-other-stuff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-02-24  7:11     ` Mi, Dapeng
@ 2026-02-24 19:13       ` Chang S. Bae
  2026-02-25  0:55         ` Mi, Dapeng
  0 siblings, 1 reply; 45+ messages in thread
From: Chang S. Bae @ 2026-02-24 19:13 UTC (permalink / raw)
  To: Mi, Dapeng, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On 2/23/2026 11:11 PM, Mi, Dapeng wrote:
> 
> intent of this whole function is to sample the required SIMD and APX eGPRs
> registers by leveraging xsaves instruction in a NMI handler context. All
> the SIMD and eGPRs registers are collectively called "extended registers"
> here. Not sure if the "extended registers" is a good word, please suggest
> if there is a better one.

They are sometimes referred as 'xregs' in short. Also this is a local 
function. Following the do_something() style, perhaps just 
update_xregs_state() or update_perf_xregs()

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  2026-02-24 19:13       ` Chang S. Bae
@ 2026-02-25  0:35         ` Mi, Dapeng
  0 siblings, 0 replies; 45+ messages in thread
From: Mi, Dapeng @ 2026-02-25  0:35 UTC (permalink / raw)
  To: Chang S. Bae, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Dave Hansen, Ian Rogers, Adrian Hunter, Jiri Olsa,
	Alexander Shishkin, Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 2/25/2026 3:13 AM, Chang S. Bae wrote:
> On 2/23/2026 9:35 PM, Mi, Dapeng wrote:
>> Thanks for the reminding. I would add a "Closes" tag here.
> But Closes is for a bug fix:
>
> Another tag is used for linking web pages with additional backgrounds or 
> details, for example an earlier discussion which leads to the patch or a 
> document with a specification implemented by the patch:
>
> Link: https://example.com/somewhere.html  optional-other-stuff
>
> As per guidance from the Chief Penguin, a Link: tag should only be added 
> to a commit if it leads to useful information that is not found in the 
> commit itself.
>
> If the URL points to a public bug report being fixed by the patch, use 
> the “Closes:” tag instead:
>
> Closes: https://example.com/issues/1234  optional-other-stuff

Ok, thanks for pointing out this.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-02-24 19:13       ` Chang S. Bae
@ 2026-02-25  0:55         ` Mi, Dapeng
  2026-02-25  1:11           ` Chang S. Bae
  0 siblings, 1 reply; 45+ messages in thread
From: Mi, Dapeng @ 2026-02-25  0:55 UTC (permalink / raw)
  To: Chang S. Bae, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Dave Hansen, Ian Rogers, Adrian Hunter, Jiri Olsa,
	Alexander Shishkin, Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 2/25/2026 3:13 AM, Chang S. Bae wrote:
> On 2/23/2026 11:11 PM, Mi, Dapeng wrote:
>> intent of this whole function is to sample the required SIMD and APX eGPRs
>> registers by leveraging xsaves instruction in a NMI handler context. All
>> the SIMD and eGPRs registers are collectively called "extended registers"
>> here. Not sure if the "extended registers" is a good word, please suggest
>> if there is a better one.
> They are sometimes referred as 'xregs' in short. Also this is a local 
> function. Following the do_something() style, perhaps just 
> update_xregs_state() or update_perf_xregs()

Thanks, 'xregs' is a good word. But considering current naming convention
in arch/x86/events/core.c, I would add the "x86_pmu" prefix and name the
function to "x86_pmu_sample_xregs". In the Perf/PMU context, "sample" is a
more precise word than "update". 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-02-25  0:55         ` Mi, Dapeng
@ 2026-02-25  1:11           ` Chang S. Bae
  2026-02-25  1:36             ` Mi, Dapeng
  0 siblings, 1 reply; 45+ messages in thread
From: Chang S. Bae @ 2026-02-25  1:11 UTC (permalink / raw)
  To: Mi, Dapeng, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On 2/24/2026 4:55 PM, Mi, Dapeng wrote:
> 
> Thanks, 'xregs' is a good word. But considering current naming convention
> in arch/x86/events/core.c, I would add the "x86_pmu" prefix

Do you know or can explain the purpose of that prefix on every function 
there? Is it for 'git grep x86_pmu'? But that's probably useful for 
those in a header file. Looks like just make it longer without good reason.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-02-25  1:11           ` Chang S. Bae
@ 2026-02-25  1:36             ` Mi, Dapeng
  2026-02-25  3:14               ` Chang S. Bae
  0 siblings, 1 reply; 45+ messages in thread
From: Mi, Dapeng @ 2026-02-25  1:36 UTC (permalink / raw)
  To: Chang S. Bae, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Dave Hansen, Ian Rogers, Adrian Hunter, Jiri Olsa,
	Alexander Shishkin, Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 2/25/2026 9:11 AM, Chang S. Bae wrote:
> On 2/24/2026 4:55 PM, Mi, Dapeng wrote:
>> Thanks, 'xregs' is a good word. But considering current naming convention
>> in arch/x86/events/core.c, I would add the "x86_pmu" prefix
> Do you know or can explain the purpose of that prefix on every function 
> there? Is it for 'git grep x86_pmu'? But that's probably useful for 
> those in a header file. Looks like just make it longer without good reason.

Per my understanding, the naming prefix "x86_pmu" is to emphasize the
function is a x86-arch generic function instead of a vendor specific PMU
function, like intel_pmu_* or amd_pmu_*, or other architectural PMU functions. 

E.g., there is the generic x86 event enabling function
x86_pmu_enable_event() and the vendor specific event enabling functions,
intel_pmu_enable_event()/amd_pmu_enable_event() in current perf code.

Thanks.

>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-02-25  1:36             ` Mi, Dapeng
@ 2026-02-25  3:14               ` Chang S. Bae
  2026-02-25  6:13                 ` Mi, Dapeng
  0 siblings, 1 reply; 45+ messages in thread
From: Chang S. Bae @ 2026-02-25  3:14 UTC (permalink / raw)
  To: Mi, Dapeng, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On 2/24/2026 5:36 PM, Mi, Dapeng wrote:
> 
> E.g., there is the generic x86 event enabling function
> x86_pmu_enable_event() 
Again that's in the header file, which looks okay to me.

Now looking at the file, core.c

    get_possible_counter_mask()
    reserve_pmc_hardware()
    release_pmc_hardware()
    set_ext_hw_attr()
    precise_br_compat()
    add_nr_metric_event()
    collect_event()
    ...

Then all of them violate that naming convention?

There is also another pattern assigning x86_pmu_xxx() to function 
pointers in struct pmu. Certainly this xsaving function itself isn't the 
case.





^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  2026-02-25  3:14               ` Chang S. Bae
@ 2026-02-25  6:13                 ` Mi, Dapeng
  0 siblings, 0 replies; 45+ messages in thread
From: Mi, Dapeng @ 2026-02-25  6:13 UTC (permalink / raw)
  To: Chang S. Bae, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Dave Hansen, Ian Rogers, Adrian Hunter, Jiri Olsa,
	Alexander Shishkin, Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 2/25/2026 11:14 AM, Chang S. Bae wrote:
> On 2/24/2026 5:36 PM, Mi, Dapeng wrote:
>> E.g., there is the generic x86 event enabling function
>> x86_pmu_enable_event() 
> Again that's in the header file, which looks okay to me.
>
> Now looking at the file, core.c
>
>     get_possible_counter_mask()
>     reserve_pmc_hardware()
>     release_pmc_hardware()
>     set_ext_hw_attr()
>     precise_br_compat()
>     add_nr_metric_event()
>     collect_event()
>     ...
>
> Then all of them violate that naming convention?
>
> There is also another pattern assigning x86_pmu_xxx() to function 
> pointers in struct pmu. Certainly this xsaving function itself isn't the 
> case.

IMO, the functions mentioned above serve as auxiliary functions and are not
directly tied to the core PMU functionalities such as event enabling,
disabling, counting, and sampling... It would be fine to not add the prefix.

Currently, there are no strict naming conventions for functions in perf, so
I can't say which one must be right or better. But at least from my side, I
believe the function x86_pmu_sample_extended_regs() does more than just
wrap a xsaves instruction. It performs the complete sampling process for
extended registers, including parsing the extended register mask, executing
the xsaves instruction, and updating the perf_regs pointers based on the
data obtained from xsaves. Thus, it is not a simple auxiliary function.
Additionally, x86_pmu_sample_xregs() is already concise, and further
shortening seems unnecessary. Thanks.


>
>
>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  2026-02-11 19:55     ` Dave Hansen
  2026-02-24  6:50       ` Mi, Dapeng
@ 2026-02-25 13:02       ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2026-02-25 13:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Chang S. Bae, Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane, Mark Rutland, broonie, Ravi Bangoria,
	linux-kernel, linux-perf-users, Zide Chen, Falcon Thomas,
	Dapeng Mi, Xudong Hao

On Wed, Feb 11, 2026 at 11:55:10AM -0800, Dave Hansen wrote:

> This:
> 
> >  	/* Swap fpstate */
> >  	if (enter_guest) {
> > -		fpu->__task_fpstate = cur_fps;
> > +		WRITE_ONCE(fpu->__task_fpstate, cur_fps);
> > +		barrier();
> >  		fpu->fpstate = guest_fps;
> >  		guest_fps->in_use = true;
> >  	} else {
> >  		guest_fps->in_use = false;
> >  		fpu->fpstate = fpu->__task_fpstate;
> > -		fpu->__task_fpstate = NULL;
> > +		barrier();
> > +		WRITE_ONCE(fpu->__task_fpstate, NULL);
> >  	}
> 
> also urgently needs comments.
> 
> I also can't help but think that there might be a nicer way to do that
> without the barrier(). I _think_ two correctly-ordered WRITE_ONCE()'s
> would make the compiler do the same thing as the barrier().
> 
> But I'm not fully understanding what the barrier() is doing anyway, so
> take that with a grain of salt.

barrier() restricts load/store movement by the compiler. A load/store
cannot be moved across barrier().


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2026-02-25 13:02 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09  7:20 [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
2026-02-09  7:20 ` [Patch v6 01/22] perf/x86/intel: Restrict PEBS_ENABLE writes to PEBS-capable counters Dapeng Mi
2026-02-10 15:36   ` Peter Zijlstra
2026-02-11  5:47     ` Mi, Dapeng
2026-02-09  7:20 ` [Patch v6 02/22] perf/x86/intel: Enable large PEBS sampling for XMMs Dapeng Mi
2026-02-09  7:20 ` [Patch v6 03/22] perf/x86/intel: Convert x86_perf_regs to per-cpu variables Dapeng Mi
2026-02-09  7:20 ` [Patch v6 04/22] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
2026-02-09  7:20 ` [Patch v6 05/22] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
2026-02-10 18:40   ` Peter Zijlstra
2026-02-11  6:26     ` Mi, Dapeng
2026-02-09  7:20 ` [Patch v6 06/22] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data() Dapeng Mi
2026-02-09  7:20 ` [Patch v6 07/22] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
2026-02-09  7:20 ` [Patch v6 08/22] x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state Dapeng Mi
2026-02-11 19:39   ` Chang S. Bae
2026-02-11 19:55     ` Dave Hansen
2026-02-24  6:50       ` Mi, Dapeng
2026-02-25 13:02       ` Peter Zijlstra
2026-02-24  5:35     ` Mi, Dapeng
2026-02-24 19:13       ` Chang S. Bae
2026-02-25  0:35         ` Mi, Dapeng
2026-02-09  7:20 ` [Patch v6 09/22] perf: Move and rename has_extended_regs() for ARCH-specific use Dapeng Mi
2026-02-09  7:20 ` [Patch v6 10/22] perf/x86: Enable XMM Register Sampling for Non-PEBS Events Dapeng Mi
2026-02-15 23:58   ` Chang S. Bae
2026-02-24  7:11     ` Mi, Dapeng
2026-02-24 19:13       ` Chang S. Bae
2026-02-25  0:55         ` Mi, Dapeng
2026-02-25  1:11           ` Chang S. Bae
2026-02-25  1:36             ` Mi, Dapeng
2026-02-25  3:14               ` Chang S. Bae
2026-02-25  6:13                 ` Mi, Dapeng
2026-02-09  7:20 ` [Patch v6 11/22] perf/x86: Enable XMM register sampling for REGS_USER case Dapeng Mi
2026-02-09  7:20 ` [Patch v6 12/22] perf: Add sampling support for SIMD registers Dapeng Mi
2026-02-10 20:04   ` Peter Zijlstra
2026-02-11  6:56     ` Mi, Dapeng
2026-02-09  7:20 ` [Patch v6 13/22] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
2026-02-09  7:20 ` [Patch v6 14/22] perf/x86: Enable YMM " Dapeng Mi
2026-02-09  7:20 ` [Patch v6 15/22] perf/x86: Enable ZMM " Dapeng Mi
2026-02-09  7:20 ` [Patch v6 16/22] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
2026-02-09  7:20 ` [Patch v6 17/22] perf: Enhance perf_reg_validate() with simd_enabled argument Dapeng Mi
2026-02-09  7:20 ` [Patch v6 18/22] perf/x86: Enable eGPRs sampling using sample_regs_* fields Dapeng Mi
2026-02-09  7:20 ` [Patch v6 19/22] perf/x86: Enable SSP " Dapeng Mi
2026-02-09  7:20 ` [Patch v6 20/22] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
2026-02-09  7:20 ` [Patch v6 21/22] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
2026-02-09  7:20 ` [Patch v6 22/22] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
2026-02-09  8:48 ` [Patch v6 00/22] Support SIMD/eGPRs/SSP registers sampling for perf Mi, Dapeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox