[Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf
@ 2025-12-03  6:54 Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 01/19] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
                   ` (19 more replies)
  0 siblings, 20 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Changes since V4:
- Rewrite some functions comments and commit messages (Dave)
- Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
- Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
  activating back-to-back NMI detection mechanism (Patch 16/19)
- Fix some minor issues on perf-tool patches (Patch 18/19)

Changes since V3:
- Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
- Only dump the available regs, rather than zero and dump the
  unavailable regs. It's possible that the dumped registers are a subset
  of the requested registers.
- Some minor updates to address Dapeng's comments in V3.

Changes since V2:
- Use the FPU format for the x86_pmu.ext_regs_mask as well
- Add a check before invoking xsaves_nmi()
- Add perf_simd_reg_check() to retrieve the number of available
  registers. If the kernel fails to get the requested registers, e.g.,
  XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
- Add POC perf tool patches

Changes since V1:
- Apply the new interfaces to configure and dump the SIMD registers
- Utilize the existing FPU functions, e.g., xstate_calculate_size,
  get_xsave_addr().

Starting from Intel Ice Lake, XMM registers can be collected in a PEBS
record. Future Architecture PEBS will include additional registers such
as YMM, ZMM, OPMASK, SSP and APX eGPRs, contingent on hardware support.

This patch set introduces a software solution to mitigate the hardware
requirement by utilizing the XSAVES command to retrieve the requested
registers in the overflow handler. This feature is no longer limited to
PEBS events or specific platforms. While the hardware solution remains
preferable due to its lower overhead and higher accuracy, this software
approach provides a viable alternative.

The solution is theoretically compatible with all x86 platforms but is
currently enabled on newer platforms, including Sapphire Rapids and
later P-core server platforms, Sierra Forest and later E-core server
platforms and recent Client platforms, like Arrow Lake, Panther Lake and
Nova Lake.

Newly supported registers include YMM, ZMM, OPMASK, SSP, and APX eGPRs.
Due to space constraints in sample_regs_user/intr, new fields have been 
introduced in the perf_event_attr structure to accommodate these
registers.

After a long discussion in V1,
https://lore.kernel.org/lkml/3f1c9a9e-cb63-47ff-a5e9-06555fa6cc9a@linux.intel.com/
The below new fields are introduced.

@@ -543,6 +545,25 @@ struct perf_event_attr {
        __u64   sig_data;

        __u64   config3; /* extension of config2 */
+
+
+       /*
+        * Defines set of SIMD registers to dump on samples.
+        * The sample_simd_regs_enabled !=0 implies the
+        * set of SIMD registers is used to config all SIMD registers.
+        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+        * config some SIMD registers on X86.
+        */
+       union {
+               __u16 sample_simd_regs_enabled;
+               __u16 sample_simd_pred_reg_qwords;
+       };
+       __u32 sample_simd_pred_reg_intr;
+       __u32 sample_simd_pred_reg_user;
+       __u16 sample_simd_vec_reg_qwords;
+       __u64 sample_simd_vec_reg_intr;
+       __u64 sample_simd_vec_reg_user;
+       __u32 __reserved_4;
 };
@@ -1016,7 +1037,15 @@ enum perf_event_type {
         *      } && PERF_SAMPLE_BRANCH_STACK
         *
         *      { u64                   abi; # enum perf_sample_regs_abi
-        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;
+        *              u16 vector_qwords;
+        *              u16 nr_pred;
+        *              u16 pred_qwords;
+        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_USER
         *
         *      { u64                   size;
         *        char                  data[size];
@@ -1043,7 +1072,15 @@ enum perf_event_type {
         *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
         *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
         *      { u64                   abi; # enum perf_sample_regs_abi
-        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;
+        *              u16 vector_qwords;
+        *              u16 nr_pred;
+        *              u16 pred_qwords;
+        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_INTR
         *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
         *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
         *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE


To maintain simplicity, a single field, sample_{simd|pred}_vec_reg_qwords,
is introduced to indicate register width. For example:
- sample_simd_vec_reg_qwords = 2 for XMM registers (128 bits) on x86
- sample_simd_vec_reg_qwords = 4 for YMM registers (256 bits) on x86

Four additional fields, sample_{simd|pred}_vec_reg_{intr|user}, represent
the bitmap of sampling registers. For instance, the bitmap for x86
XMM registers is 0xffff (16 XMM registers). Although users can
theoretically sample a subset of registers, the current perf-tool
implementation supports sampling all registers of each type to avoid
complexity.

A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is introduced to signal user space 
tools about the presence of SIMD registers in sampling records. When this
flag is detected, tools should recognize that extra SIMD register data
follows the general register data. The layout of the extra SIMD register
data is displayed as follow.

   u16 nr_vectors;
   u16 vector_qwords;
   u16 nr_pred;
   u16 pred_qwords;
   u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];

With this patch set, sampling for the aforementioned registers is
supported on the Intel Nova Lake platform.

Examples:
 $perf record -I?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

 $perf record --user-regs=?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

 $perf record -e branches:p -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -c 100000 ./test
 $perf report -D

 ... ...
 14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
 0xffffffff9f085e24 period: 100000 addr: 0
 ... intr regs: mask 0x18001010003 ABI 64-bit
 .... AX    0xdffffc0000000000
 .... BX    0xffff8882297685e8
 .... R8    0x0000000000000000
 .... R16   0x0000000000000000
 .... R31   0x0000000000000000
 .... SSP   0x0000000000000000
 ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
 .... ZMM  [0] 0xffffffffffffffff
 .... ZMM  [0] 0x0000000000000001
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [1] 0x003a6b6165506d56
 ... ...
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... OPMASK[0] 0x00000000fffffe00
 .... OPMASK[1] 0x0000000000ffffff
 .... OPMASK[2] 0x000000000000007f
 .... OPMASK[3] 0x0000000000000000
 .... OPMASK[4] 0x0000000000010080
 .... OPMASK[5] 0x0000000000000000
 .... OPMASK[6] 0x0000400004000000
 .... OPMASK[7] 0x0000000000000000
 ... ...


History:
  v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
  v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
  v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
  v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/

Dapeng Mi (3):
  perf: Eliminate duplicate arch-specific functions definations
  perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling
  perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
    NMIs

Kan Liang (16):
  perf/x86: Use x86_perf_regs in the x86 nmi handler
  perf/x86: Introduce x86-specific x86_pmu_setup_regs_data()
  x86/fpu/xstate: Add xsaves_nmi() helper
  perf: Move and rename has_extended_regs() for ARCH-specific use
  perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  perf: Add sampling support for SIMD registers
  perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Enable YMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Enable ZMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields
  perf/x86: Enable eGPRs sampling using sample_regs_* fields
  perf/x86: Enable SSP sampling using sample_regs_* fields
  perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
  perf headers: Sync with the kernel headers
  perf parse-regs: Support new SIMD sampling format
  perf regs: Enable dumping of SIMD registers

 arch/arm/kernel/perf_regs.c                   |   8 +-
 arch/arm64/kernel/perf_regs.c                 |   8 +-
 arch/csky/kernel/perf_regs.c                  |   8 +-
 arch/loongarch/kernel/perf_regs.c             |   8 +-
 arch/mips/kernel/perf_regs.c                  |   8 +-
 arch/parisc/kernel/perf_regs.c                |   8 +-
 arch/powerpc/perf/perf_regs.c                 |   2 +-
 arch/riscv/kernel/perf_regs.c                 |   8 +-
 arch/s390/kernel/perf_regs.c                  |   2 +-
 arch/x86/events/core.c                        | 326 +++++++++++-
 arch/x86/events/intel/core.c                  | 117 ++++-
 arch/x86/events/intel/ds.c                    | 134 ++++-
 arch/x86/events/perf_event.h                  |  85 +++-
 arch/x86/include/asm/fpu/xstate.h             |   3 +
 arch/x86/include/asm/msr-index.h              |   7 +
 arch/x86/include/asm/perf_event.h             |  38 +-
 arch/x86/include/uapi/asm/perf_regs.h         |  62 +++
 arch/x86/kernel/fpu/xstate.c                  |  25 +-
 arch/x86/kernel/perf_regs.c                   | 131 ++++-
 include/linux/perf_event.h                    |  16 +
 include/linux/perf_regs.h                     |  36 +-
 include/uapi/linux/perf_event.h               |  45 +-
 kernel/events/core.c                          | 132 ++++-
 tools/arch/x86/include/uapi/asm/perf_regs.h   |  62 +++
 tools/include/uapi/linux/perf_event.h         |  45 +-
 tools/perf/arch/x86/util/perf_regs.c          | 470 +++++++++++++++++-
 tools/perf/util/evsel.c                       |  47 ++
 tools/perf/util/parse-regs-options.c          | 151 +++++-
 .../perf/util/perf-regs-arch/perf_regs_x86.c  |  43 ++
 tools/perf/util/perf_event_attr_fprintf.c     |   6 +
 tools/perf/util/perf_regs.c                   |  59 +++
 tools/perf/util/perf_regs.h                   |  11 +
 tools/perf/util/record.h                      |   6 +
 tools/perf/util/sample.h                      |  10 +
 tools/perf/util/session.c                     |  78 ++-
 35 files changed, 2012 insertions(+), 193 deletions(-)


base-commit: 9929dffce5ed7e2988e0274f4db98035508b16d9
prerequisite-patch-id: a15bcd62a8dcd219d17489eef88b66ea5488a2a0
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [Patch v5 01/19] perf: Eliminate duplicate arch-specific functions definations
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 02/19] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Define default common __weak functions for perf_reg_value(),
perf_reg_validate(), perf_reg_abi() and perf_get_regs_user(). This helps
to eliminate the duplicated arch-specific definations.

No function changes intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/arm/kernel/perf_regs.c       |  6 ------
 arch/arm64/kernel/perf_regs.c     |  6 ------
 arch/csky/kernel/perf_regs.c      |  6 ------
 arch/loongarch/kernel/perf_regs.c |  6 ------
 arch/mips/kernel/perf_regs.c      |  6 ------
 arch/parisc/kernel/perf_regs.c    |  6 ------
 arch/riscv/kernel/perf_regs.c     |  6 ------
 arch/x86/kernel/perf_regs.c       |  6 ------
 include/linux/perf_regs.h         | 32 ++++++-------------------------
 kernel/events/core.c              | 22 +++++++++++++++++++++
 10 files changed, 28 insertions(+), 74 deletions(-)

diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index 0529f90395c9..d575a4c3ca56 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -31,9 +31,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index b4eece3eb17d..70e2f13f587f 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -98,9 +98,3 @@ u64 perf_reg_abi(struct task_struct *task)
 		return PERF_SAMPLE_REGS_ABI_64;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/csky/kernel/perf_regs.c b/arch/csky/kernel/perf_regs.c
index 09b7f88a2d6a..94601f37b596 100644
--- a/arch/csky/kernel/perf_regs.c
+++ b/arch/csky/kernel/perf_regs.c
@@ -31,9 +31,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/loongarch/kernel/perf_regs.c b/arch/loongarch/kernel/perf_regs.c
index 263ac4ab5af6..8dd604f01745 100644
--- a/arch/loongarch/kernel/perf_regs.c
+++ b/arch/loongarch/kernel/perf_regs.c
@@ -45,9 +45,3 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs->regs[idx];
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/mips/kernel/perf_regs.c b/arch/mips/kernel/perf_regs.c
index e686780d1647..7736d3c5ebd2 100644
--- a/arch/mips/kernel/perf_regs.c
+++ b/arch/mips/kernel/perf_regs.c
@@ -60,9 +60,3 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return (s64)v; /* Sign extend if 32-bit. */
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/parisc/kernel/perf_regs.c b/arch/parisc/kernel/perf_regs.c
index 68458e2f6197..87e6990569a7 100644
--- a/arch/parisc/kernel/perf_regs.c
+++ b/arch/parisc/kernel/perf_regs.c
@@ -53,9 +53,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_64;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/riscv/kernel/perf_regs.c b/arch/riscv/kernel/perf_regs.c
index fd304a248de6..3bba8deababb 100644
--- a/arch/riscv/kernel/perf_regs.c
+++ b/arch/riscv/kernel/perf_regs.c
@@ -35,9 +35,3 @@ u64 perf_reg_abi(struct task_struct *task)
 #endif
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 624703af80a1..81204cb7f723 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -100,12 +100,6 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
 #else /* CONFIG_X86_64 */
 #define REG_NOSUPPORT ((1ULL << PERF_REG_X86_DS) | \
 		       (1ULL << PERF_REG_X86_ES) | \
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index f632c5725f16..144bcc3ff19f 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -9,6 +9,12 @@ struct perf_regs {
 	struct pt_regs	*regs;
 };
 
+u64 perf_reg_value(struct pt_regs *regs, int idx);
+int perf_reg_validate(u64 mask);
+u64 perf_reg_abi(struct task_struct *task);
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs);
+
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
 
@@ -16,35 +22,9 @@ struct perf_regs {
 #define PERF_REG_EXTENDED_MASK	0
 #endif
 
-u64 perf_reg_value(struct pt_regs *regs, int idx);
-int perf_reg_validate(u64 mask);
-u64 perf_reg_abi(struct task_struct *task);
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs);
 #else
 
 #define PERF_REG_EXTENDED_MASK	0
 
-static inline u64 perf_reg_value(struct pt_regs *regs, int idx)
-{
-	return 0;
-}
-
-static inline int perf_reg_validate(u64 mask)
-{
-	return mask ? -ENOSYS : 0;
-}
-
-static inline u64 perf_reg_abi(struct task_struct *task)
-{
-	return PERF_SAMPLE_REGS_ABI_NONE;
-}
-
-static inline void perf_get_regs_user(struct perf_regs *regs_user,
-				      struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
 #endif /* CONFIG_HAVE_PERF_REGS */
 #endif /* _LINUX_PERF_REGS_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f6a08c73f783..efc938c6a2be 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7431,6 +7431,28 @@ unsigned long perf_instruction_pointer(struct perf_event *event,
 	return perf_arch_instruction_pointer(regs);
 }
 
+u64 __weak perf_reg_value(struct pt_regs *regs, int idx)
+{
+	return 0;
+}
+
+int __weak perf_reg_validate(u64 mask)
+{
+	return mask ? -ENOSYS : 0;
+}
+
+u64 __weak perf_reg_abi(struct task_struct *task)
+{
+	return PERF_SAMPLE_REGS_ABI_NONE;
+}
+
+void __weak perf_get_regs_user(struct perf_regs *regs_user,
+			       struct pt_regs *regs)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
+
 static void
 perf_output_sample_regs(struct perf_output_handle *handle,
 			struct pt_regs *regs, u64 mask)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 02/19] perf/x86: Use x86_perf_regs in the x86 nmi handler
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 01/19] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 03/19] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data() Dapeng Mi
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

More and more regs will be supported in the overflow, e.g., more vector
registers, SSP, etc. The generic pt_regs struct cannot store all of
them. Use a X86 specific x86_perf_regs instead.

The struct pt_regs *regs is still passed to x86_pmu_handle_irq(). There
is no functional change for the existing code.

AMD IBS's NMI handler doesn't utilize the static call
x86_pmu_handle_irq(). The x86_perf_regs struct doesn't apply to the AMD
IBS. It can be added separately later when AMD IBS supports more regs.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5d0d5e466c62..ef3bf8fbc97f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1762,6 +1762,7 @@ void perf_events_lapic_init(void)
 static int
 perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
+	struct x86_perf_regs x86_regs;
 	u64 start_clock;
 	u64 finish_clock;
 	int ret;
@@ -1774,7 +1775,8 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 		return NMI_DONE;
 
 	start_clock = sched_clock();
-	ret = static_call(x86_pmu_handle_irq)(regs);
+	x86_regs.regs = *regs;
+	ret = static_call(x86_pmu_handle_irq)(&x86_regs.regs);
 	finish_clock = sched_clock();
 
 	perf_sample_event_took(finish_clock - start_clock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 03/19] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data()
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 01/19] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 02/19] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 04/19] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The current perf/x86 implementation uses the generic functions
perf_sample_regs_user() and perf_sample_regs_intr() to set up registers
data for sampling records. While this approach works for general
registers, it falls short when adding sampling support for SIMD and APX
eGPRs registers on x86 platforms.

To address this, we introduce the x86-specific function
x86_pmu_setup_regs_data() for setting up register data on x86 platforms.

At present, x86_pmu_setup_regs_data() mirrors the logic of the generic
functions perf_sample_regs_user() and perf_sample_regs_intr().
Subsequent patches will introduce x86-specific enhancements.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c       | 32 ++++++++++++++++++++++++++++++++
 arch/x86/events/intel/ds.c   |  9 ++++++---
 arch/x86/events/perf_event.h |  4 ++++
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index ef3bf8fbc97f..dcdd2c2d68ee 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1695,6 +1695,38 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs)
+{
+	u64 sample_type = event->attr.sample_type;
+
+	if (sample_type & PERF_SAMPLE_REGS_USER) {
+		if (user_mode(regs)) {
+			data->regs_user.abi = perf_reg_abi(current);
+			data->regs_user.regs = regs;
+		} else if (!(current->flags & PF_KTHREAD)) {
+			perf_get_regs_user(&data->regs_user, regs);
+		} else {
+			data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
+			data->regs_user.regs = NULL;
+		}
+		data->dyn_size += sizeof(u64);
+		if (data->regs_user.regs)
+			data->dyn_size += hweight64(event->attr.sample_regs_user) * sizeof(u64);
+		data->sample_flags |= PERF_SAMPLE_REGS_USER;
+	}
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR) {
+		data->regs_intr.regs = regs;
+		data->regs_intr.abi = perf_reg_abi(current);
+		data->dyn_size += sizeof(u64);
+		if (data->regs_intr.regs)
+			data->dyn_size += hweight64(event->attr.sample_regs_intr) * sizeof(u64);
+		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
+	}
+}
+
 int x86_pmu_handle_irq(struct pt_regs *regs)
 {
 	struct perf_sample_data data;
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 2e170f2093ac..c7351f476d8c 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2180,6 +2180,7 @@ static inline void __setup_pebs_basic_group(struct perf_event *event,
 }
 
 static inline void __setup_pebs_gpr_group(struct perf_event *event,
+					  struct perf_sample_data *data,
 					  struct pt_regs *regs,
 					  struct pebs_gprs *gprs,
 					  u64 sample_type)
@@ -2189,8 +2190,10 @@ static inline void __setup_pebs_gpr_group(struct perf_event *event,
 		regs->flags &= ~PERF_EFLAGS_EXACT;
 	}
 
-	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
+	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
 		adaptive_pebs_save_regs(regs, gprs);
+		x86_pmu_setup_regs_data(event, data, regs);
+	}
 }
 
 static inline void __setup_pebs_meminfo_group(struct perf_event *event,
@@ -2283,7 +2286,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		gprs = next_record;
 		next_record = gprs + 1;
 
-		__setup_pebs_gpr_group(event, regs, gprs, sample_type);
+		__setup_pebs_gpr_group(event, data, regs, gprs, sample_type);
 	}
 
 	if (format_group & PEBS_DATACFG_MEMINFO) {
@@ -2407,7 +2410,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		gprs = next_record;
 		next_record = gprs + 1;
 
-		__setup_pebs_gpr_group(event, regs,
+		__setup_pebs_gpr_group(event, data, regs,
 				       (struct pebs_gprs *)gprs,
 				       sample_type);
 	}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 3161ec0a3416..80e52e937638 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1294,6 +1294,10 @@ void x86_pmu_enable_event(struct perf_event *event);
 
 int x86_pmu_handle_irq(struct pt_regs *regs);
 
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs);
+
 void x86_pmu_show_pmu_cap(struct pmu *pmu);
 
 static inline int x86_pmu_num_counters(struct pmu *pmu)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 04/19] x86/fpu/xstate: Add xsaves_nmi() helper
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (2 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 03/19] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data() Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 05/19] perf: Move and rename has_extended_regs() for ARCH-specific use Dapeng Mi
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Add xsaves_nmi() to save supported xsave states in NMI handler.

This function is similar to xsaves(), but should only be called within
a NMI handler. This function returns the actual register contents at
the moment the NMI occurs.

Currently the perf subsystem is the sole user of this helper. It uses
this function to snapshot SIMD (XMM/YMM/ZMM) and APX eGPRs registers
which would be added in subsequent patches.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/fpu/xstate.h |  1 +
 arch/x86/kernel/fpu/xstate.c      | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 7a7dc9d56027..38fa8ff26559 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -110,6 +110,7 @@ int xfeature_size(int xfeature_nr);
 
 void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
+void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
 int xfd_enable_feature(u64 xfd_err);
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 28e4fd65c9da..e3b8afed8b2c 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1474,6 +1474,29 @@ void xrstors(struct xregs_state *xstate, u64 mask)
 	WARN_ON_ONCE(err);
 }
 
+/**
+ * xsaves_nmi - Save selected components to a kernel xstate buffer in NMI
+ * @xstate:	Pointer to the buffer
+ * @mask:	Feature mask to select the components to save
+ *
+ * This function is similar to xsaves(), but should only be called within
+ * a NMI handler. This function returns the actual register contents at
+ * the moment the NMI occurs.
+ *
+ * Currently, the perf subsystem is the sole user of this helper. It uses
+ * the function to snapshot SIMD (XMM/YMM/ZMM) and APX eGPRs registers.
+ */
+void xsaves_nmi(struct xregs_state *xstate, u64 mask)
+{
+	int err;
+
+	if (!in_nmi())
+		return;
+
+	XSTATE_OP(XSAVES, xstate, (u32)mask, (u32)(mask >> 32), err);
+	WARN_ON_ONCE(err);
+}
+
 #if IS_ENABLED(CONFIG_KVM)
 void fpstate_clear_xstate_component(struct fpstate *fpstate, unsigned int xfeature)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 05/19] perf: Move and rename has_extended_regs() for ARCH-specific use
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (3 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 04/19] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER Dapeng Mi
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The has_extended_regs() function will be utilized in ARCH-specific code.
To facilitate this, move it to header file perf_event.h

Additionally, the function is renamed to event_has_extended_regs() which
aligns with the existing naming conventions.

No functional change intended.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 include/linux/perf_event.h | 8 ++++++++
 kernel/events/core.c       | 8 +-------
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9870d768db4c..5153b70d09c8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1526,6 +1526,14 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_extended_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return (attr->sample_regs_user & PERF_REG_EXTENDED_MASK) ||
+	       (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK);
+}
+
 static inline bool event_has_any_exclude_flag(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index efc938c6a2be..3e9c48fa2202 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12664,12 +12664,6 @@ int perf_pmu_unregister(struct pmu *pmu)
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
-static inline bool has_extended_regs(struct perf_event *event)
-{
-	return (event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK) ||
-	       (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK);
-}
-
 static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 {
 	struct perf_event_context *ctx = NULL;
@@ -12704,7 +12698,7 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 		goto err_pmu;
 
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
-	    has_extended_regs(event)) {
+	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
 		goto err_destroy;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (4 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 05/19] perf: Move and rename has_extended_regs() for ARCH-specific use Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-04 15:17   ` Peter Zijlstra
  2025-12-03  6:54 ` [Patch v5 07/19] perf: Add sampling support for SIMD registers Dapeng Mi
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

While collecting XMM registers in a PEBS record has been supported since
Icelake, non-PEBS events have lacked this feature. By leveraging the
xsaves instruction, it is now possible to snapshot XMM registers for
non-PEBS events, completing the feature set.

To utilize the xsaves instruction, a 64-byte aligned buffer is required.
A per-CPU ext_regs_buf is added to store SIMD and other registers, with
the buffer size being approximately 2K. The buffer is allocated using
kzalloc_node(), ensuring natural alignment and 64-byte alignment for all
kmalloc() allocations with powers of 2.

The XMM sampling support is extended for both REGS_USER and REGS_INTR.
For REGS_USER, perf_get_regs_user() returns the registers from
task_pt_regs(current), which is a pt_regs structure. It needs to be
copied to user space secific x86_user_regs structure since kernel may
modify pt_regs structure later.

For PEBS, XMM registers are retrieved from PEBS records.

In cases where userspace tasks are trapped within kernel mode (e.g.,
during a syscall) when an NMI arrives, pt_regs information can still be
retrieved from task_pt_regs(). However, capturing SIMD and other
xsave-based registers in this scenario is challenging. Therefore,
snapshots for these registers are omitted in such cases.

The reasons are:
- Profiling a userspace task that requires SIMD/eGPR registers typically
  involves NMIs hitting userspace, not kernel mode.
- Although it is possible to retrieve values when the TIF_NEED_FPU_LOAD
  flag is set, the complexity introduced to handle this uncommon case in
  the critical path is not justified.
- Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
  Some corner cases, such as an NMI occurring just after the flag switches
  but still in kernel mode, cannot be handled.

Future support for additional vector registers is anticipated.
An ext_regs_mask is added to track the supported vector register groups.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c            | 175 ++++++++++++++++++++++++++----
 arch/x86/events/intel/core.c      |  29 ++++-
 arch/x86/events/intel/ds.c        |  20 ++--
 arch/x86/events/perf_event.h      |  11 +-
 arch/x86/include/asm/fpu/xstate.h |   2 +
 arch/x86/include/asm/perf_event.h |   5 +-
 arch/x86/kernel/fpu/xstate.c      |   2 +-
 7 files changed, 212 insertions(+), 32 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index dcdd2c2d68ee..0d33668b1927 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -406,6 +406,62 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event)
 	return x86_pmu_extra_regs(val, event);
 }
 
+static DEFINE_PER_CPU(struct xregs_state *, ext_regs_buf);
+
+static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
+{
+	struct xregs_state *xsave = per_cpu(ext_regs_buf, smp_processor_id());
+	u64 valid_mask = x86_pmu.ext_regs_mask & mask;
+
+	if (WARN_ON_ONCE(!xsave))
+		return;
+
+	xsaves_nmi(xsave, valid_mask);
+
+	/* Filtered by what XSAVE really gives */
+	valid_mask &= xsave->header.xfeatures;
+
+	if (valid_mask & XFEATURE_MASK_SSE)
+		perf_regs->xmm_space = xsave->i387.xmm_space;
+}
+
+static void release_ext_regs_buffers(void)
+{
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		kfree(per_cpu(ext_regs_buf, cpu));
+		per_cpu(ext_regs_buf, cpu) = NULL;
+	}
+}
+
+static void reserve_ext_regs_buffers(void)
+{
+	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
+	unsigned int size;
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	size = xstate_calculate_size(x86_pmu.ext_regs_mask, compacted);
+
+	for_each_possible_cpu(cpu) {
+		per_cpu(ext_regs_buf, cpu) = kzalloc_node(size, GFP_KERNEL,
+							  cpu_to_node(cpu));
+		if (!per_cpu(ext_regs_buf, cpu))
+			goto err;
+	}
+
+	return;
+
+err:
+	release_ext_regs_buffers();
+}
+
 int x86_reserve_hardware(void)
 {
 	int err = 0;
@@ -418,6 +474,7 @@ int x86_reserve_hardware(void)
 			} else {
 				reserve_ds_buffers();
 				reserve_lbr_buffers();
+				reserve_ext_regs_buffers();
 			}
 		}
 		if (!err)
@@ -434,6 +491,7 @@ void x86_release_hardware(void)
 		release_pmc_hardware();
 		release_ds_buffers();
 		release_lbr_buffers();
+		release_ext_regs_buffers();
 		mutex_unlock(&pmc_reserve_mutex);
 	}
 }
@@ -651,19 +709,17 @@ int x86_pmu_hw_config(struct perf_event *event)
 			return -EINVAL;
 	}
 
-	/* sample_regs_user never support XMM registers */
-	if (unlikely(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK))
-		return -EINVAL;
-	/*
-	 * Besides the general purpose registers, XMM registers may
-	 * be collected in PEBS on some platforms, e.g. Icelake
-	 */
-	if (unlikely(event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK)) {
-		if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
-			return -EINVAL;
-
-		if (!event->attr.precise_ip)
-			return -EINVAL;
+	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
+		/*
+		 * Besides the general purpose registers, XMM registers may
+		 * be collected as well.
+		 */
+		if (event_has_extended_regs(event)) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+				return -EINVAL;
+			if (!event->attr.precise_ip)
+				return -EINVAL;
+		}
 	}
 
 	return x86_setup_perfctr(event);
@@ -1695,38 +1751,115 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
-void x86_pmu_setup_regs_data(struct perf_event *event,
-			     struct perf_sample_data *data,
-			     struct pt_regs *regs)
+static DEFINE_PER_CPU(struct x86_perf_regs, x86_user_regs);
+
+static struct x86_perf_regs *
+x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
+			   struct pt_regs *regs)
+{
+	struct x86_perf_regs *x86_regs_user = this_cpu_ptr(&x86_user_regs);
+	struct perf_regs regs_user;
+
+	perf_get_regs_user(&regs_user, regs);
+	data->regs_user.abi = regs_user.abi;
+	if (regs_user.regs) {
+		x86_regs_user->regs = *regs_user.regs;
+		data->regs_user.regs = &x86_regs_user->regs;
+	} else
+		data->regs_user.regs = NULL;
+	return x86_regs_user;
+}
+
+static bool x86_pmu_user_req_pt_regs_only(struct perf_event *event)
 {
-	u64 sample_type = event->attr.sample_type;
+	return !(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK);
+}
+
+inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
+{
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+	perf_regs->xmm_regs = NULL;
+}
+
+static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
+					  struct perf_sample_data *data,
+					  struct pt_regs *regs)
+{
+	struct perf_event_attr *attr = &event->attr;
+	u64 sample_type = attr->sample_type;
+	struct x86_perf_regs *perf_regs;
+
+	if (!(attr->sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)))
+		return;
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
+		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
 		if (user_mode(regs)) {
 			data->regs_user.abi = perf_reg_abi(current);
 			data->regs_user.regs = regs;
-		} else if (!(current->flags & PF_KTHREAD)) {
-			perf_get_regs_user(&data->regs_user, regs);
+		} else if (!(current->flags & PF_KTHREAD) &&
+			   x86_pmu_user_req_pt_regs_only(event)) {
+			/*
+			 * It cannot guarantee that the kernel will never
+			 * touch the registers outside of the pt_regs,
+			 * especially when more and more registers
+			 * (e.g., SIMD, eGPR) are added. The live data
+			 * cannot be used.
+			 * Dump the registers when only pt_regs are required.
+			 */
+			perf_regs = x86_pmu_perf_get_regs_user(data, regs);
 		} else {
 			data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
 			data->regs_user.regs = NULL;
 		}
 		data->dyn_size += sizeof(u64);
 		if (data->regs_user.regs)
-			data->dyn_size += hweight64(event->attr.sample_regs_user) * sizeof(u64);
+			data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
 		data->sample_flags |= PERF_SAMPLE_REGS_USER;
 	}
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR) {
+		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
 		data->regs_intr.regs = regs;
 		data->regs_intr.abi = perf_reg_abi(current);
 		data->dyn_size += sizeof(u64);
 		if (data->regs_intr.regs)
-			data->dyn_size += hweight64(event->attr.sample_regs_intr) * sizeof(u64);
+			data->dyn_size += hweight64(attr->sample_regs_intr) * sizeof(u64);
 		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
 	}
 }
 
+static void x86_pmu_sample_ext_regs(struct perf_event *event,
+				    struct pt_regs *regs,
+				    u64 ignore_mask)
+{
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+	u64 mask = 0;
+
+	if (event_has_extended_regs(event))
+		mask |= XFEATURE_MASK_SSE;
+
+	mask &= ~ignore_mask;
+	if (mask)
+		x86_pmu_get_ext_regs(perf_regs, mask);
+}
+
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs,
+			     u64 ignore_mask)
+{
+	x86_pmu_setup_basic_regs_data(event, data, regs);
+	/*
+	 * ignore_mask indicates the PEBS sampled extended regs
+	 * which is unnessary to sample again.
+	 */
+	x86_pmu_sample_ext_regs(event, regs, ignore_mask);
+}
+
 int x86_pmu_handle_irq(struct pt_regs *regs)
 {
 	struct perf_sample_data data;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 81e6c8bcabde..b5c89e8eabb2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3410,6 +3410,9 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 		if (has_branch_stack(event))
 			intel_pmu_lbr_save_brstack(&data, cpuc, event);
 
+		x86_pmu_clear_perf_regs(regs);
+		x86_pmu_setup_regs_data(event, &data, regs, 0);
+
 		perf_event_overflow(event, &data, regs);
 	}
 
@@ -5619,8 +5622,30 @@ static inline void __intel_update_large_pebs_flags(struct pmu *pmu)
 	}
 }
 
-#define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
+static void intel_extended_regs_init(struct pmu *pmu)
+{
+	/*
+	 * Extend the vector registers support to non-PEBS.
+	 * The feature is limited to newer Intel machines with
+	 * PEBS V4+ or archPerfmonExt (0x23) enabled for now.
+	 * In theory, the vector registers can be retrieved as
+	 * long as the CPU supports. The support for the old
+	 * generations may be added later if there is a
+	 * requirement.
+	 * Only support the extension when XSAVES is available.
+	 */
+	if (!boot_cpu_has(X86_FEATURE_XSAVES))
+		return;
 
+	if (!boot_cpu_has(X86_FEATURE_XMM) ||
+	    !cpu_has_xfeatures(XFEATURE_MASK_SSE, NULL))
+		return;
+
+	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
+	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+}
+
+#define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
 static void update_pmu_cap(struct pmu *pmu)
 {
 	unsigned int eax, ebx, ecx, edx;
@@ -5682,6 +5707,8 @@ static void update_pmu_cap(struct pmu *pmu)
 		/* Perf Metric (Bit 15) and PEBS via PT (Bit 16) are hybrid enumeration */
 		rdmsrq(MSR_IA32_PERF_CAPABILITIES, hybrid(pmu, intel_cap).capabilities);
 	}
+
+	intel_extended_regs_init(pmu);
 }
 
 static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index c7351f476d8c..af462f69cd1c 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1473,8 +1473,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
-	    (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
+	if (event_has_extended_regs(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
@@ -2190,10 +2189,8 @@ static inline void __setup_pebs_gpr_group(struct perf_event *event,
 		regs->flags &= ~PERF_EFLAGS_EXACT;
 	}
 
-	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
+	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
 		adaptive_pebs_save_regs(regs, gprs);
-		x86_pmu_setup_regs_data(event, data, regs);
-	}
 }
 
 static inline void __setup_pebs_meminfo_group(struct perf_event *event,
@@ -2251,6 +2248,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 	struct pebs_meminfo *meminfo = NULL;
 	struct pebs_gprs *gprs = NULL;
 	struct x86_perf_regs *perf_regs;
+	u64 ignore_mask = 0;
 	u64 format_group;
 	u16 retire;
 
@@ -2258,7 +2256,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		return;
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
-	perf_regs->xmm_regs = NULL;
+	x86_pmu_clear_perf_regs(regs);
 
 	format_group = basic->format_group;
 
@@ -2305,6 +2303,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 	if (format_group & PEBS_DATACFG_XMMS) {
 		struct pebs_xmm *xmm = next_record;
 
+		ignore_mask |= XFEATURE_MASK_SSE;
 		next_record = xmm + 1;
 		perf_regs->xmm_regs = xmm->xmm;
 	}
@@ -2343,6 +2342,8 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		next_record += nr * sizeof(u64);
 	}
 
+	x86_pmu_setup_regs_data(event, data, regs, ignore_mask);
+
 	WARN_ONCE(next_record != __pebs + basic->format_size,
 			"PEBS record size %u, expected %llu, config %llx\n",
 			basic->format_size,
@@ -2368,6 +2369,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 	struct arch_pebs_aux *meminfo = NULL;
 	struct arch_pebs_gprs *gprs = NULL;
 	struct x86_perf_regs *perf_regs;
+	u64 ignore_mask = 0;
 	void *next_record;
 	void *at = __pebs;
 
@@ -2375,7 +2377,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		return;
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
-	perf_regs->xmm_regs = NULL;
+	x86_pmu_clear_perf_regs(regs);
 
 	__setup_perf_sample_data(event, iregs, data);
 
@@ -2430,6 +2432,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 
 		next_record += sizeof(struct arch_pebs_xer_header);
 
+		ignore_mask |= XFEATURE_MASK_SSE;
 		xmm = next_record;
 		perf_regs->xmm_regs = xmm->xmm;
 		next_record = xmm + 1;
@@ -2477,6 +2480,8 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		at = at + header->size;
 		goto again;
 	}
+
+	x86_pmu_setup_regs_data(event, data, regs, ignore_mask);
 }
 
 static inline void *
@@ -3137,6 +3142,7 @@ static void __init intel_ds_pebs_init(void)
 				x86_pmu.flags |= PMU_FL_PEBS_ALL;
 				x86_pmu.pebs_capable = ~0ULL;
 				pebs_qual = "-baseline";
+				x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
 				x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
 			} else {
 				/* Only basic record supported */
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 80e52e937638..3c470d79aa65 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1009,6 +1009,12 @@ struct x86_pmu {
 	struct extra_reg *extra_regs;
 	unsigned int flags;
 
+	/*
+	 * Extended regs, e.g., vector registers
+	 * Utilize the same format as the XFEATURE_MASK_*
+	 */
+	u64		ext_regs_mask;
+
 	/*
 	 * Intel host/guest support (KVM)
 	 */
@@ -1294,9 +1300,12 @@ void x86_pmu_enable_event(struct perf_event *event);
 
 int x86_pmu_handle_irq(struct pt_regs *regs);
 
+void x86_pmu_clear_perf_regs(struct pt_regs *regs);
+
 void x86_pmu_setup_regs_data(struct perf_event *event,
 			     struct perf_sample_data *data,
-			     struct pt_regs *regs);
+			     struct pt_regs *regs,
+			     u64 ignore_mask);
 
 void x86_pmu_show_pmu_cap(struct pmu *pmu);
 
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 38fa8ff26559..19dec5f0b1c7 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -112,6 +112,8 @@ void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
 void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
+
 int xfd_enable_feature(u64 xfd_err);
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 7276ba70c88a..3b368de9f803 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -704,7 +704,10 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
-	u64		*xmm_regs;
+	union {
+		u64	*xmm_regs;
+		u32	*xmm_space;	/* for xsaves */
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index e3b8afed8b2c..33142bccc075 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -586,7 +586,7 @@ static bool __init check_xstate_against_struct(int nr)
 	return true;
 }
 
-static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
 {
 	unsigned int topmost = fls64(xfeatures) -  1;
 	unsigned int offset, i;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 07/19] perf: Add sampling support for SIMD registers
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (5 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-05 11:07   ` Peter Zijlstra
  2025-12-05 11:40   ` Peter Zijlstra
  2025-12-03  6:54 ` [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
                   ` (12 subsequent siblings)
  19 siblings, 2 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Users may be interested in sampling SIMD registers during profiling.
The current sample_regs_* structure does not have sufficient space
for all SIMD registers.

To address this, new attribute fields sample_simd_{pred,vec}_reg_* are
added to struct perf_event_attr to represent the SIMD registers that are
expected to be sampled.

Currently, the perf/x86 code supports XMM registers in sample_regs_*.
To unify the configuration of SIMD registers and ensure a consistent
method for configuring XMM and other SIMD registers, a new event
attribute field, sample_simd_regs_enabled, is introduced. When
sample_simd_regs_enabled is set, it indicates that all SIMD registers,
including XMM, will be represented by the newly introduced
sample_simd_{pred|vec}_reg_* fields. The original XMM space in
sample_regs_* is reserved for future uses.

Since SIMD registers are wider than 64 bits, a new output format is
introduced. The number and width of SIMD registers are dumped first,
followed by the register values. The number and width are based on the
user's configuration. If they differ (e.g., on ARM), an ARCH-specific
perf_output_sample_simd_regs function can be implemented separately.

A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is added to indicate the new format.
The enum perf_sample_regs_abi is now a bitmap. This change should not
impact existing tools, as the version and bitmap remain the same for
values 1 and 2.

Additionally, two new __weak functions are introduced:
- perf_simd_reg_value(): Retrieves the value of the requested SIMD
  register.
- perf_simd_reg_validate(): Validates the configuration of the SIMD
  registers.

A new flag, PERF_PMU_CAP_SIMD_REGS, is added to indicate that the PMU
supports SIMD register dumping. An error is generated if
sample_simd_{pred|vec}_reg_* is mistakenly set for a PMU that does not
support this capability.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 include/linux/perf_event.h      |  8 +++
 include/linux/perf_regs.h       |  4 ++
 include/uapi/linux/perf_event.h | 45 ++++++++++++++--
 kernel/events/core.c            | 96 +++++++++++++++++++++++++++++++--
 4 files changed, 146 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5153b70d09c8..87d3bdbef30e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -305,6 +305,7 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_EXTENDED_HW_TYPE	0x0100
 #define PERF_PMU_CAP_AUX_PAUSE		0x0200
 #define PERF_PMU_CAP_AUX_PREFER_LARGE	0x0400
+#define PERF_PMU_CAP_SIMD_REGS		0x0800
 
 /**
  * pmu::scope
@@ -1526,6 +1527,13 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_simd_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return attr->sample_simd_regs_enabled != 0;
+}
+
 static inline bool event_has_extended_regs(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 144bcc3ff19f..518f28c6a7d4 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -14,6 +14,10 @@ int perf_reg_validate(u64 mask);
 u64 perf_reg_abi(struct task_struct *task);
 void perf_get_regs_user(struct perf_regs *regs_user,
 			struct pt_regs *regs);
+int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+			   u16 pred_qwords, u32 pred_mask);
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred);
 
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d292f96bc06f..f1474da32622 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -314,8 +314,9 @@ enum {
  */
 enum perf_sample_regs_abi {
 	PERF_SAMPLE_REGS_ABI_NONE		= 0,
-	PERF_SAMPLE_REGS_ABI_32			= 1,
-	PERF_SAMPLE_REGS_ABI_64			= 2,
+	PERF_SAMPLE_REGS_ABI_32			= (1 << 0),
+	PERF_SAMPLE_REGS_ABI_64			= (1 << 1),
+	PERF_SAMPLE_REGS_ABI_SIMD		= (1 << 2),
 };
 
 /*
@@ -382,6 +383,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
 #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
+#define PERF_ATTR_SIZE_VER9			168	/* Add: sample_simd_{pred,vec}_reg_* */
 
 /*
  * 'struct perf_event_attr' contains various attributes that define
@@ -545,6 +547,25 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+
+	/*
+	 * Defines set of SIMD registers to dump on samples.
+	 * The sample_simd_regs_enabled !=0 implies the
+	 * set of SIMD registers is used to config all SIMD registers.
+	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+	 * config some SIMD registers on X86.
+	 */
+	union {
+		__u16 sample_simd_regs_enabled;
+		__u16 sample_simd_pred_reg_qwords;
+	};
+	__u32 sample_simd_pred_reg_intr;
+	__u32 sample_simd_pred_reg_user;
+	__u16 sample_simd_vec_reg_qwords;
+	__u64 sample_simd_vec_reg_intr;
+	__u64 sample_simd_vec_reg_user;
+	__u32 __reserved_4;
 };
 
 /*
@@ -1018,7 +1039,15 @@ enum perf_event_type {
 	 *      } && PERF_SAMPLE_BRANCH_STACK
 	 *
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_USER
 	 *
 	 *	{ u64			size;
 	 *	  char			data[size];
@@ -1045,7 +1074,15 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3e9c48fa2202..b19de038979e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7469,6 +7469,50 @@ perf_output_sample_regs(struct perf_output_handle *handle,
 	}
 }
 
+static void
+perf_output_sample_simd_regs(struct perf_output_handle *handle,
+			     struct perf_event *event,
+			     struct pt_regs *regs,
+			     u64 mask, u16 pred_mask)
+{
+	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
+	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
+	u64 pred_bitmap = pred_mask;
+	u64 bitmap = mask;
+	u16 nr_vectors;
+	u16 nr_pred;
+	int bit;
+	u64 val;
+	u16 i;
+
+	nr_vectors = hweight64(bitmap);
+	nr_pred = hweight64(pred_bitmap);
+
+	perf_output_put(handle, nr_vectors);
+	perf_output_put(handle, vec_qwords);
+	perf_output_put(handle, nr_pred);
+	perf_output_put(handle, pred_qwords);
+
+	if (nr_vectors) {
+		for_each_set_bit(bit, (unsigned long *)&bitmap,
+				 sizeof(bitmap) * BITS_PER_BYTE) {
+			for (i = 0; i < vec_qwords; i++) {
+				val = perf_simd_reg_value(regs, bit, i, false);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+	if (nr_pred) {
+		for_each_set_bit(bit, (unsigned long *)&pred_bitmap,
+				 sizeof(pred_bitmap) * BITS_PER_BYTE) {
+			for (i = 0; i < pred_qwords; i++) {
+				val = perf_simd_reg_value(regs, bit, i, true);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+}
+
 static void perf_sample_regs_user(struct perf_regs *regs_user,
 				  struct pt_regs *regs)
 {
@@ -7490,6 +7534,17 @@ static void perf_sample_regs_intr(struct perf_regs *regs_intr,
 	regs_intr->abi  = perf_reg_abi(current);
 }
 
+int __weak perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+				  u16 pred_qwords, u32 pred_mask)
+{
+	return vec_qwords || vec_mask || pred_qwords || pred_mask ? -ENOSYS : 0;
+}
+
+u64 __weak perf_simd_reg_value(struct pt_regs *regs, int idx,
+			       u16 qwords_idx, bool pred)
+{
+	return 0;
+}
 
 /*
  * Get remaining task size from user stack pointer.
@@ -8022,10 +8077,17 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_user;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_user;
 			perf_output_sample_regs(handle,
 						data->regs_user.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_user.regs,
+							     attr->sample_simd_vec_reg_user,
+							     attr->sample_simd_pred_reg_user);
+			}
 		}
 	}
 
@@ -8053,11 +8115,18 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_intr;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_intr;
 
 			perf_output_sample_regs(handle,
 						data->regs_intr.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_intr.regs,
+							     attr->sample_simd_vec_reg_intr,
+							     attr->sample_simd_pred_reg_intr);
+			}
 		}
 	}
 
@@ -12697,6 +12766,12 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 	if (ret)
 		goto err_pmu;
 
+	if (!(pmu->capabilities & PERF_PMU_CAP_SIMD_REGS) &&
+	    event_has_simd_regs(event)) {
+		ret = -EOPNOTSUPP;
+		goto err_destroy;
+	}
+
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
 	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
@@ -13238,6 +13313,12 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		ret = perf_reg_validate(attr->sample_regs_user);
 		if (ret)
 			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_user,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_user);
+		if (ret)
+			return ret;
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_STACK_USER) {
@@ -13258,8 +13339,17 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (!attr->sample_max_stack)
 		attr->sample_max_stack = sysctl_perf_event_max_stack;
 
-	if (attr->sample_type & PERF_SAMPLE_REGS_INTR)
+	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
 		ret = perf_reg_validate(attr->sample_regs_intr);
+		if (ret)
+			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_intr,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_intr);
+		if (ret)
+			return ret;
+	}
 
 #ifndef CONFIG_CGROUP_PERF
 	if (attr->sample_type & PERF_SAMPLE_CGROUP)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (6 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 07/19] perf: Add sampling support for SIMD registers Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-05 11:25   ` Peter Zijlstra
  2025-12-03  6:54 ` [Patch v5 09/19] perf/x86: Enable YMM " Dapeng Mi
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for sampling XMM registers using the
sample_simd_vec_reg_* fields.

When sample_simd_regs_enabled is set, the original XMM space in the
sample_regs_* field is treated as reserved. An INVAL error will be
reported to user space if any bit is set in the original XMM space while
sample_simd_regs_enabled is set.

The perf_reg_value function requires ABI information to understand the
layout of sample_regs. To accommodate this, a new abi field is introduced
in the struct x86_perf_regs to represent ABI information.

Additionally, the X86-specific perf_simd_reg_value function is implemented
to retrieve the XMM register values.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 78 ++++++++++++++++++++++++++-
 arch/x86/events/intel/ds.c            |  2 +-
 arch/x86/events/perf_event.h          | 12 +++++
 arch/x86/include/asm/perf_event.h     |  1 +
 arch/x86/include/uapi/asm/perf_regs.h | 17 ++++++
 arch/x86/kernel/perf_regs.c           | 51 +++++++++++++++++-
 6 files changed, 158 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 0d33668b1927..8f7e7e81daaf 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -719,6 +719,22 @@ int x86_pmu_hw_config(struct perf_event *event)
 				return -EINVAL;
 			if (!event->attr.precise_ip)
 				return -EINVAL;
+			if (event->attr.sample_simd_regs_enabled)
+				return -EINVAL;
+		}
+
+		if (event_has_simd_regs(event)) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
+				return -EINVAL;
+			/* Not require any vector registers but set width */
+			if (event->attr.sample_simd_vec_reg_qwords &&
+			    !event->attr.sample_simd_vec_reg_intr &&
+			    !event->attr.sample_simd_vec_reg_user)
+				return -EINVAL;
+			/* The vector registers set is not supported */
+			if (event_needs_xmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
+				return -EINVAL;
 		}
 	}
 
@@ -1760,6 +1776,7 @@ x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
 	struct x86_perf_regs *x86_regs_user = this_cpu_ptr(&x86_user_regs);
 	struct perf_regs regs_user;
 
+	x86_regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
 	perf_get_regs_user(&regs_user, regs);
 	data->regs_user.abi = regs_user.abi;
 	if (regs_user.regs) {
@@ -1772,9 +1789,26 @@ x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
 
 static bool x86_pmu_user_req_pt_regs_only(struct perf_event *event)
 {
+	if (event->attr.sample_simd_regs_enabled)
+		return false;
 	return !(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK);
 }
 
+static inline void
+x86_pmu_update_ext_regs_size(struct perf_event_attr *attr,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs,
+			     u64 mask, u16 pred_mask)
+{
+	u16 pred_qwords = attr->sample_simd_pred_reg_qwords;
+	u16 vec_qwords = attr->sample_simd_vec_reg_qwords;
+	u64 pred_bitmap = pred_mask;
+	u64 bitmap = mask;
+
+	data->dyn_size += (hweight64(bitmap) * vec_qwords +
+			   hweight64(pred_bitmap) * pred_qwords) * sizeof(u64);
+}
+
 inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 {
 	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
@@ -1795,6 +1829,7 @@ static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		perf_regs->abi = PERF_SAMPLE_REGS_ABI_NONE;
 
 		if (user_mode(regs)) {
 			data->regs_user.abi = perf_reg_abi(current);
@@ -1817,17 +1852,24 @@ static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
 		data->dyn_size += sizeof(u64);
 		if (data->regs_user.regs)
 			data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
+		perf_regs->abi |= data->regs_user.abi;
+		if (attr->sample_simd_regs_enabled)
+			perf_regs->abi |= PERF_SAMPLE_REGS_ABI_SIMD;
 		data->sample_flags |= PERF_SAMPLE_REGS_USER;
 	}
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		perf_regs->abi = PERF_SAMPLE_REGS_ABI_NONE;
 
 		data->regs_intr.regs = regs;
 		data->regs_intr.abi = perf_reg_abi(current);
 		data->dyn_size += sizeof(u64);
 		if (data->regs_intr.regs)
 			data->dyn_size += hweight64(attr->sample_regs_intr) * sizeof(u64);
+		perf_regs->abi |= data->regs_intr.abi;
+		if (attr->sample_simd_regs_enabled)
+			perf_regs->abi |= PERF_SAMPLE_REGS_ABI_SIMD;
 		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
 	}
 }
@@ -1839,7 +1881,7 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
 	u64 mask = 0;
 
-	if (event_has_extended_regs(event))
+	if (event_needs_xmm(event))
 		mask |= XFEATURE_MASK_SSE;
 
 	mask &= ~ignore_mask;
@@ -1847,6 +1889,39 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 		x86_pmu_get_ext_regs(perf_regs, mask);
 }
 
+static void x86_pmu_setup_extended_regs_data(struct perf_event *event,
+					     struct perf_sample_data *data,
+					     struct pt_regs *regs)
+{
+	struct perf_event_attr *attr = &event->attr;
+	u64 sample_type = attr->sample_type;
+
+	if (!attr->sample_simd_regs_enabled)
+		return;
+
+	if (!(attr->sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)))
+		return;
+
+	/* Update the data[] size */
+	if (sample_type & PERF_SAMPLE_REGS_USER && data->regs_user.abi) {
+		/* num and qwords of vector and pred registers */
+		data->dyn_size += sizeof(u64);
+		data->regs_user.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		x86_pmu_update_ext_regs_size(attr, data, data->regs_user.regs,
+					     attr->sample_simd_vec_reg_user,
+					     attr->sample_simd_pred_reg_user);
+	}
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR && data->regs_intr.abi) {
+		/* num and qwords of vector and pred registers */
+		data->dyn_size += sizeof(u64);
+		data->regs_intr.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		x86_pmu_update_ext_regs_size(attr, data, data->regs_intr.regs,
+					     attr->sample_simd_vec_reg_intr,
+					     attr->sample_simd_pred_reg_intr);
+	}
+}
+
 void x86_pmu_setup_regs_data(struct perf_event *event,
 			     struct perf_sample_data *data,
 			     struct pt_regs *regs,
@@ -1858,6 +1933,7 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 	 * which is unnessary to sample again.
 	 */
 	x86_pmu_sample_ext_regs(event, regs, ignore_mask);
+	x86_pmu_setup_extended_regs_data(event, data, regs);
 }
 
 int x86_pmu_handle_irq(struct pt_regs *regs)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index af462f69cd1c..79cba323eeb1 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1473,7 +1473,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if (event_has_extended_regs(event))
+	if (event_needs_xmm(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 3c470d79aa65..e5d8ad024553 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -133,6 +133,18 @@ static inline bool is_acr_event_group(struct perf_event *event)
 	return check_leader_group(event->group_leader, PERF_X86_EVENT_ACR);
 }
 
+static inline bool event_needs_xmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_XMM_QWORDS)
+		return true;
+
+	if (!event->attr.sample_simd_regs_enabled &&
+	    event_has_extended_regs(event))
+		return true;
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 3b368de9f803..5d623805bf87 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -704,6 +704,7 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
+	u64		abi;
 	union {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..c3862e5fdd6d 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -55,4 +55,21 @@ enum perf_event_x86_regs {
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
 
+enum {
+	PERF_REG_X86_XMM,
+	PERF_REG_X86_MAX_SIMD_REGS,
+};
+
+enum {
+	PERF_X86_SIMD_XMM_REGS      = 16,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
+};
+
+#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
+
+enum {
+	PERF_X86_XMM_QWORDS      = 2,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
+};
+
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 81204cb7f723..9947a6b5c260 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -63,6 +63,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
+			return 0;
 		if (!perf_regs->xmm_regs)
 			return 0;
 		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
@@ -74,6 +77,51 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred)
+{
+	struct x86_perf_regs *perf_regs =
+			container_of(regs, struct x86_perf_regs, regs);
+
+	if (pred)
+		return 0;
+
+	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
+			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
+		return 0;
+
+	if (qwords_idx < PERF_X86_XMM_QWORDS) {
+		if (!perf_regs->xmm_regs)
+			return 0;
+		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS +
+					   qwords_idx];
+	}
+
+	return 0;
+}
+
+int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+			   u16 pred_qwords, u32 pred_mask)
+{
+	/* pred_qwords implies sample_simd_{pred,vec}_reg_* are supported */
+	if (!pred_qwords)
+		return 0;
+
+	if (!vec_qwords) {
+		if (vec_mask)
+			return -EINVAL;
+	} else {
+		if (vec_qwords != PERF_X86_XMM_QWORDS)
+			return -EINVAL;
+		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
+			return -EINVAL;
+	}
+	if (pred_mask)
+		return -EINVAL;
+
+	return 0;
+}
+
 #define PERF_REG_X86_RESERVED	(((1ULL << PERF_REG_X86_XMM0) - 1) & \
 				 ~((1ULL << PERF_REG_X86_MAX) - 1))
 
@@ -108,7 +156,8 @@ u64 perf_reg_abi(struct task_struct *task)
 
 int perf_reg_validate(u64 mask)
 {
-	if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
+	/* The mask could be 0 if only the SIMD registers are interested */
+	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
 		return -EINVAL;
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 09/19] perf/x86: Enable YMM sampling using sample_simd_vec_reg_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (7 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 10/19] perf/x86: Enable ZMM " Dapeng Mi
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch introduces support for sampling YMM registers via the
sample_simd_vec_reg_* fields.

Each YMM register consists of 4 u64 words, assembled from two halves:
XMM (the lower 2 u64 words) and YMMH (the upper 2 u64 words). Although
both XMM and YMMH data can be retrieved with a single xsaves instruction,
they are stored in separate locations. The perf_simd_reg_value() function
is responsible for assembling these halves into a complete YMM register
for output to userspace.

Additionally, sample_simd_vec_reg_qwords should be set to 4 to indicate
YMM sampling.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 9 +++++++++
 arch/x86/events/perf_event.h          | 9 +++++++++
 arch/x86/include/asm/perf_event.h     | 4 ++++
 arch/x86/include/uapi/asm/perf_regs.h | 8 ++++++--
 arch/x86/kernel/perf_regs.c           | 8 +++++++-
 5 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 8f7e7e81daaf..b1e62c061d9e 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -423,6 +423,9 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 
 	if (valid_mask & XFEATURE_MASK_SSE)
 		perf_regs->xmm_space = xsave->i387.xmm_space;
+
+	if (valid_mask & XFEATURE_MASK_YMM)
+		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
 }
 
 static void release_ext_regs_buffers(void)
@@ -735,6 +738,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_xmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
 				return -EINVAL;
+			if (event_needs_ymm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
+				return -EINVAL;
 		}
 	}
 
@@ -1814,6 +1820,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
 
 	perf_regs->xmm_regs = NULL;
+	perf_regs->ymmh_regs = NULL;
 }
 
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
@@ -1883,6 +1890,8 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 
 	if (event_needs_xmm(event))
 		mask |= XFEATURE_MASK_SSE;
+	if (event_needs_ymm(event))
+		mask |= XFEATURE_MASK_YMM;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e5d8ad024553..3d4577a1bb7d 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -145,6 +145,15 @@ static inline bool event_needs_xmm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_ymm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS)
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 5d623805bf87..25f5ae60f72f 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -709,6 +709,10 @@ struct x86_perf_regs {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
 	};
+	union {
+		u64	*ymmh_regs;
+		struct ymmh_struct *ymmh;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index c3862e5fdd6d..4fd598785f6d 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -57,19 +57,23 @@ enum perf_event_x86_regs {
 
 enum {
 	PERF_REG_X86_XMM,
+	PERF_REG_X86_YMM,
 	PERF_REG_X86_MAX_SIMD_REGS,
 };
 
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
-	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
+	PERF_X86_SIMD_YMM_REGS      = 16,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_YMM_REGS,
 };
 
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
 enum {
 	PERF_X86_XMM_QWORDS      = 2,
-	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
+	PERF_X86_YMMH_QWORDS     = 2,
+	PERF_X86_YMM_QWORDS      = 4,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_YMM_QWORDS,
 };
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 9947a6b5c260..8aa61a18fd71 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -95,6 +95,11 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			return 0;
 		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS +
 					   qwords_idx];
+	} else if (qwords_idx < PERF_X86_YMM_QWORDS) {
+		if (!perf_regs->ymmh_regs)
+			return 0;
+		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS +
+					    qwords_idx - PERF_X86_XMM_QWORDS];
 	}
 
 	return 0;
@@ -111,7 +116,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 		if (vec_mask)
 			return -EINVAL;
 	} else {
-		if (vec_qwords != PERF_X86_XMM_QWORDS)
+		if (vec_qwords != PERF_X86_XMM_QWORDS &&
+		    vec_qwords != PERF_X86_YMM_QWORDS)
 			return -EINVAL;
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 10/19] perf/x86: Enable ZMM sampling using sample_simd_vec_reg_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (8 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 09/19] perf/x86: Enable YMM " Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 11/19] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for sampling ZMM registers via the
sample_simd_vec_reg_* fields.

Each ZMM register consists of 8 u64 words. Current x86 hardware supports
up to 32 ZMM registers. For ZMM registers from ZMM0 to ZMM15, they are
assembled from three parts: XMM (the lower 2 u64 words),
YMMH (the middle 2 u64 words), and ZMMH (the upper 4 u64 words). The
perf_simd_reg_value() function is responsible for assembling these three
parts into a complete ZMM register for output to userspace.

For ZMM registers ZMM16 to ZMM31, each register can be read as a whole
and directly outputted to userspace.

Additionally, sample_simd_vec_reg_qwords should be set to 8 to indicate
ZMM sampling.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 16 ++++++++++++++++
 arch/x86/events/perf_event.h          | 19 +++++++++++++++++++
 arch/x86/include/asm/perf_event.h     |  8 ++++++++
 arch/x86/include/uapi/asm/perf_regs.h | 11 +++++++++--
 arch/x86/kernel/perf_regs.c           | 15 ++++++++++++++-
 5 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index b1e62c061d9e..d9c2cab5dcb9 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -426,6 +426,10 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 
 	if (valid_mask & XFEATURE_MASK_YMM)
 		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
+	if (valid_mask & XFEATURE_MASK_ZMM_Hi256)
+		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
+	if (valid_mask & XFEATURE_MASK_Hi16_ZMM)
+		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 }
 
 static void release_ext_regs_buffers(void)
@@ -741,6 +745,12 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_ymm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
 				return -EINVAL;
+			if (event_needs_low16_zmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_ZMM_Hi256))
+				return -EINVAL;
+			if (event_needs_high16_zmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_Hi16_ZMM))
+				return -EINVAL;
 		}
 	}
 
@@ -1821,6 +1831,8 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 
 	perf_regs->xmm_regs = NULL;
 	perf_regs->ymmh_regs = NULL;
+	perf_regs->zmmh_regs = NULL;
+	perf_regs->h16zmm_regs = NULL;
 }
 
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
@@ -1892,6 +1904,10 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_SSE;
 	if (event_needs_ymm(event))
 		mask |= XFEATURE_MASK_YMM;
+	if (event_needs_low16_zmm(event))
+		mask |= XFEATURE_MASK_ZMM_Hi256;
+	if (event_needs_high16_zmm(event))
+		mask |= XFEATURE_MASK_Hi16_ZMM;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 3d4577a1bb7d..9a871809a4aa 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -154,6 +154,25 @@ static inline bool event_needs_ymm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_low16_zmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_ZMM_QWORDS)
+		return true;
+
+	return false;
+}
+
+static inline bool event_needs_high16_zmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+	     fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 25f5ae60f72f..e4d9a8ba3e95 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -713,6 +713,14 @@ struct x86_perf_regs {
 		u64	*ymmh_regs;
 		struct ymmh_struct *ymmh;
 	};
+	union {
+		u64	*zmmh_regs;
+		struct avx_512_zmm_uppers_state *zmmh;
+	};
+	union {
+		u64	*h16zmm_regs;
+		struct avx_512_hi16_state *h16zmm;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 4fd598785f6d..96db454c7923 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -58,22 +58,29 @@ enum perf_event_x86_regs {
 enum {
 	PERF_REG_X86_XMM,
 	PERF_REG_X86_YMM,
+	PERF_REG_X86_ZMM,
 	PERF_REG_X86_MAX_SIMD_REGS,
 };
 
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
 	PERF_X86_SIMD_YMM_REGS      = 16,
-	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_YMM_REGS,
+	PERF_X86_SIMD_ZMMH_REGS     = 16,
+	PERF_X86_SIMD_ZMM_REGS      = 32,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
 };
 
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
+#define PERF_X86_H16ZMM_BASE		PERF_X86_SIMD_ZMMH_REGS
+
 enum {
 	PERF_X86_XMM_QWORDS      = 2,
 	PERF_X86_YMMH_QWORDS     = 2,
 	PERF_X86_YMM_QWORDS      = 4,
-	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_YMM_QWORDS,
+	PERF_X86_ZMMH_QWORDS     = 4,
+	PERF_X86_ZMM_QWORDS      = 8,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
 };
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 8aa61a18fd71..0a3ffaaea3aa 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -90,6 +90,13 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
 		return 0;
 
+	if (idx >= PERF_X86_H16ZMM_BASE) {
+		if (!perf_regs->h16zmm_regs)
+			return 0;
+		return perf_regs->h16zmm_regs[(idx - PERF_X86_H16ZMM_BASE) *
+					PERF_X86_ZMM_QWORDS + qwords_idx];
+	}
+
 	if (qwords_idx < PERF_X86_XMM_QWORDS) {
 		if (!perf_regs->xmm_regs)
 			return 0;
@@ -100,6 +107,11 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			return 0;
 		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS +
 					    qwords_idx - PERF_X86_XMM_QWORDS];
+	} else if (qwords_idx < PERF_X86_ZMM_QWORDS) {
+		if (!perf_regs->zmmh_regs)
+			return 0;
+		return perf_regs->zmmh_regs[idx * PERF_X86_ZMMH_QWORDS +
+					    qwords_idx - PERF_X86_YMM_QWORDS];
 	}
 
 	return 0;
@@ -117,7 +129,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 			return -EINVAL;
 	} else {
 		if (vec_qwords != PERF_X86_XMM_QWORDS &&
-		    vec_qwords != PERF_X86_YMM_QWORDS)
+		    vec_qwords != PERF_X86_YMM_QWORDS &&
+		    vec_qwords != PERF_X86_ZMM_QWORDS)
 			return -EINVAL;
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 11/19] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (9 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 10/19] perf/x86: Enable ZMM " Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields Dapeng Mi
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for sampling OPAMSK registers via the
sample_simd_pred_reg_* fields.

Each OPMASK register consists of 1 u64 word. Current x86 hardware
supports 8 OPMASK registers. The perf_simd_reg_value() function is
responsible for outputting OPMASK value to userspace.

Additionally, sample_simd_pred_reg_qwords should be set to 1 to indicate
OPMASK sampling.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                |  8 ++++++++
 arch/x86/events/perf_event.h          | 10 ++++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  8 ++++++++
 arch/x86/kernel/perf_regs.c           | 15 ++++++++++++---
 5 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d9c2cab5dcb9..3a4144ee0b7b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -430,6 +430,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
 	if (valid_mask & XFEATURE_MASK_Hi16_ZMM)
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
+	if (valid_mask & XFEATURE_MASK_OPMASK)
+		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 }
 
 static void release_ext_regs_buffers(void)
@@ -751,6 +753,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_high16_zmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_Hi16_ZMM))
 				return -EINVAL;
+			if (event_needs_opmask(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_OPMASK))
+				return -EINVAL;
 		}
 	}
 
@@ -1833,6 +1838,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->ymmh_regs = NULL;
 	perf_regs->zmmh_regs = NULL;
 	perf_regs->h16zmm_regs = NULL;
+	perf_regs->opmask_regs = NULL;
 }
 
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
@@ -1908,6 +1914,8 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_ZMM_Hi256;
 	if (event_needs_high16_zmm(event))
 		mask |= XFEATURE_MASK_Hi16_ZMM;
+	if (event_needs_opmask(event))
+		mask |= XFEATURE_MASK_OPMASK;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 9a871809a4aa..7e081a392ff8 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -173,6 +173,16 @@ static inline bool event_needs_high16_zmm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_opmask(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_simd_pred_reg_intr ||
+	     event->attr.sample_simd_pred_reg_user))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index e4d9a8ba3e95..caa6df8ac1cd 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -721,6 +721,10 @@ struct x86_perf_regs {
 		u64	*h16zmm_regs;
 		struct avx_512_hi16_state *h16zmm;
 	};
+	union {
+		u64	*opmask_regs;
+		struct avx_512_opmask_state *opmask;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 96db454c7923..6f29fd9495a2 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -60,6 +60,9 @@ enum {
 	PERF_REG_X86_YMM,
 	PERF_REG_X86_ZMM,
 	PERF_REG_X86_MAX_SIMD_REGS,
+
+	PERF_REG_X86_OPMASK = 0,
+	PERF_REG_X86_MAX_PRED_REGS = 1,
 };
 
 enum {
@@ -68,13 +71,18 @@ enum {
 	PERF_X86_SIMD_ZMMH_REGS     = 16,
 	PERF_X86_SIMD_ZMM_REGS      = 32,
 	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
+
+	PERF_X86_SIMD_OPMASK_REGS   = 8,
+	PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
 };
 
+#define PERF_X86_SIMD_PRED_MASK		GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
 #define PERF_X86_H16ZMM_BASE		PERF_X86_SIMD_ZMMH_REGS
 
 enum {
+	PERF_X86_OPMASK_QWORDS   = 1,
 	PERF_X86_XMM_QWORDS      = 2,
 	PERF_X86_YMMH_QWORDS     = 2,
 	PERF_X86_YMM_QWORDS      = 4,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 0a3ffaaea3aa..1ca24e2a6aa0 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -83,8 +83,14 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 	struct x86_perf_regs *perf_regs =
 			container_of(regs, struct x86_perf_regs, regs);
 
-	if (pred)
-		return 0;
+	if (pred) {
+		if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_PRED_REGS_MAX ||
+				 qwords_idx >= PERF_X86_OPMASK_QWORDS))
+			return 0;
+		if (!perf_regs->opmask_regs)
+			return 0;
+		return perf_regs->opmask_regs[idx];
+	}
 
 	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
@@ -135,7 +141,10 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
 	}
-	if (pred_mask)
+
+	if (pred_qwords != PERF_X86_OPMASK_QWORDS)
+		return -EINVAL;
+	if (pred_mask & ~PERF_X86_SIMD_PRED_MASK)
 		return -EINVAL;
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (10 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 11/19] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-05 12:16   ` Peter Zijlstra
  2025-12-03  6:54 ` [Patch v5 13/19] perf/x86: Enable SSP " Dapeng Mi
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch enables sampling of APX eGPRs (R16 ~ R31) via the
sample_regs_* fields.

To sample eGPRs, the sample_simd_regs_enabled field must be set. This
allows the spare space (reclaimed from the original XMM space) in the
sample_regs_* fields to be used for representing eGPRs.

The perf_reg_value() function needs to check if the
PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then determine whether
to output eGPRs or legacy XMM registers to userspace.

The perf_reg_validate() function is enhanced to validate the eGPRs bitmap
by adding a new argument, "simd_enabled".

Currently, eGPRs sampling is only supported on the x86_64 architecture, as
APX is only available on x86_64 platforms.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/arm/kernel/perf_regs.c           |  2 +-
 arch/arm64/kernel/perf_regs.c         |  2 +-
 arch/csky/kernel/perf_regs.c          |  2 +-
 arch/loongarch/kernel/perf_regs.c     |  2 +-
 arch/mips/kernel/perf_regs.c          |  2 +-
 arch/parisc/kernel/perf_regs.c        |  2 +-
 arch/powerpc/perf/perf_regs.c         |  2 +-
 arch/riscv/kernel/perf_regs.c         |  2 +-
 arch/s390/kernel/perf_regs.c          |  2 +-
 arch/x86/events/core.c                | 41 +++++++++++++++-------
 arch/x86/events/perf_event.h          | 10 ++++++
 arch/x86/include/asm/perf_event.h     |  4 +++
 arch/x86/include/uapi/asm/perf_regs.h | 25 ++++++++++++++
 arch/x86/kernel/perf_regs.c           | 49 +++++++++++++++------------
 include/linux/perf_regs.h             |  2 +-
 kernel/events/core.c                  |  8 +++--
 16 files changed, 110 insertions(+), 47 deletions(-)

diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index d575a4c3ca56..838d701adf4d 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_ARM_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index 70e2f13f587f..71a3e0238de4 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -77,7 +77,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_ARM64_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	u64 reserved_mask = REG_RESERVED;
 
diff --git a/arch/csky/kernel/perf_regs.c b/arch/csky/kernel/perf_regs.c
index 94601f37b596..c932a96afc56 100644
--- a/arch/csky/kernel/perf_regs.c
+++ b/arch/csky/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_CSKY_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/loongarch/kernel/perf_regs.c b/arch/loongarch/kernel/perf_regs.c
index 8dd604f01745..164514f40ae0 100644
--- a/arch/loongarch/kernel/perf_regs.c
+++ b/arch/loongarch/kernel/perf_regs.c
@@ -25,7 +25,7 @@ u64 perf_reg_abi(struct task_struct *tsk)
 }
 #endif /* CONFIG_32BIT */
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask)
 		return -EINVAL;
diff --git a/arch/mips/kernel/perf_regs.c b/arch/mips/kernel/perf_regs.c
index 7736d3c5ebd2..00a5201dbd5d 100644
--- a/arch/mips/kernel/perf_regs.c
+++ b/arch/mips/kernel/perf_regs.c
@@ -28,7 +28,7 @@ u64 perf_reg_abi(struct task_struct *tsk)
 }
 #endif /* CONFIG_32BIT */
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask)
 		return -EINVAL;
diff --git a/arch/parisc/kernel/perf_regs.c b/arch/parisc/kernel/perf_regs.c
index 87e6990569a7..169c25c054b2 100644
--- a/arch/parisc/kernel/perf_regs.c
+++ b/arch/parisc/kernel/perf_regs.c
@@ -34,7 +34,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_PARISC_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index 350dccb0143c..a01d8a903640 100644
--- a/arch/powerpc/perf/perf_regs.c
+++ b/arch/powerpc/perf/perf_regs.c
@@ -125,7 +125,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/riscv/kernel/perf_regs.c b/arch/riscv/kernel/perf_regs.c
index 3bba8deababb..1ecc8760b88b 100644
--- a/arch/riscv/kernel/perf_regs.c
+++ b/arch/riscv/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_RISCV_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/s390/kernel/perf_regs.c b/arch/s390/kernel/perf_regs.c
index a6b058ee4a36..c5ad9e2f489b 100644
--- a/arch/s390/kernel/perf_regs.c
+++ b/arch/s390/kernel/perf_regs.c
@@ -34,7 +34,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1UL << PERF_REG_S390_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 3a4144ee0b7b..ec0838469cae 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -432,6 +432,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 	if (valid_mask & XFEATURE_MASK_OPMASK)
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
+	if (valid_mask & XFEATURE_MASK_APX)
+		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
 }
 
 static void release_ext_regs_buffers(void)
@@ -719,22 +721,21 @@ int x86_pmu_hw_config(struct perf_event *event)
 	}
 
 	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
-		/*
-		 * Besides the general purpose registers, XMM registers may
-		 * be collected as well.
-		 */
-		if (event_has_extended_regs(event)) {
-			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
-				return -EINVAL;
-			if (!event->attr.precise_ip)
-				return -EINVAL;
-			if (event->attr.sample_simd_regs_enabled)
-				return -EINVAL;
-		}
-
 		if (event_has_simd_regs(event)) {
+			u64 reserved = ~GENMASK_ULL(PERF_REG_MISC_MAX - 1, 0);
+
 			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
 				return -EINVAL;
+			/*
+			 * The XMM space in the perf_event_x86_regs is reclaimed
+			 * for eGPRs and other general registers.
+			 */
+			if (event->attr.sample_regs_user & reserved ||
+			    event->attr.sample_regs_intr & reserved)
+				return -EINVAL;
+			if (event_needs_egprs(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
+				return -EINVAL;
 			/* Not require any vector registers but set width */
 			if (event->attr.sample_simd_vec_reg_qwords &&
 			    !event->attr.sample_simd_vec_reg_intr &&
@@ -756,6 +757,17 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_opmask(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_OPMASK))
 				return -EINVAL;
+		} else {
+			/*
+			 * Besides the general purpose registers, XMM registers may
+			 * be collected as well.
+			 */
+			if (event_has_extended_regs(event)) {
+				if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+					return -EINVAL;
+				if (!event->attr.precise_ip)
+					return -EINVAL;
+			}
 		}
 	}
 
@@ -1839,6 +1851,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->zmmh_regs = NULL;
 	perf_regs->h16zmm_regs = NULL;
 	perf_regs->opmask_regs = NULL;
+	perf_regs->egpr_regs = NULL;
 }
 
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
@@ -1916,6 +1929,8 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_Hi16_ZMM;
 	if (event_needs_opmask(event))
 		mask |= XFEATURE_MASK_OPMASK;
+	if (event_needs_egprs(event))
+		mask |= XFEATURE_MASK_APX;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7e081a392ff8..9fb1cbbc1b76 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -183,6 +183,16 @@ static inline bool event_needs_opmask(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_egprs(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_regs_user & PERF_X86_EGPRS_MASK ||
+	     event->attr.sample_regs_intr & PERF_X86_EGPRS_MASK))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index caa6df8ac1cd..ca242db3720f 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -725,6 +725,10 @@ struct x86_perf_regs {
 		u64	*opmask_regs;
 		struct avx_512_opmask_state *opmask;
 	};
+	union {
+		u64	*egpr_regs;
+		struct apx_state *egpr;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 6f29fd9495a2..f145e3b78426 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,33 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/*
+	 * The EGPRs and XMM have overlaps. Only one can be used
+	 * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
+	 * utilize EGPRs. For the other ABI type, XMM is used.
+	 *
+	 * Extended GPRs (EGPRs)
+	 */
+	PERF_REG_X86_R16,
+	PERF_REG_X86_R17,
+	PERF_REG_X86_R18,
+	PERF_REG_X86_R19,
+	PERF_REG_X86_R20,
+	PERF_REG_X86_R21,
+	PERF_REG_X86_R22,
+	PERF_REG_X86_R23,
+	PERF_REG_X86_R24,
+	PERF_REG_X86_R25,
+	PERF_REG_X86_R26,
+	PERF_REG_X86_R27,
+	PERF_REG_X86_R28,
+	PERF_REG_X86_R29,
+	PERF_REG_X86_R30,
+	PERF_REG_X86_R31,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	PERF_REG_MISC_MAX = PERF_REG_X86_R31 + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
@@ -54,6 +78,7 @@ enum perf_event_x86_regs {
 };
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
+#define PERF_X86_EGPRS_MASK	GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
 
 enum {
 	PERF_REG_X86_XMM,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 1ca24e2a6aa0..e76de39e1385 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -61,14 +61,22 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
 	struct x86_perf_regs *perf_regs;
 
-	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+	if (idx > PERF_REG_X86_R15) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
-		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
-		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
-			return 0;
-		if (!perf_regs->xmm_regs)
-			return 0;
-		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+			if (idx <= PERF_REG_X86_R31) {
+				if (!perf_regs->egpr_regs)
+					return 0;
+				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
+			}
+		} else {
+			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+				if (!perf_regs->xmm_regs)
+					return 0;
+				return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+			}
+		}
 	}
 
 	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
@@ -150,20 +158,14 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 	return 0;
 }
 
-#define PERF_REG_X86_RESERVED	(((1ULL << PERF_REG_X86_XMM0) - 1) & \
-				 ~((1ULL << PERF_REG_X86_MAX) - 1))
+#define PERF_REG_X86_RESERVED	(GENMASK_ULL(PERF_REG_X86_XMM0 - 1, PERF_REG_X86_AX) & \
+				 ~GENMASK_ULL(PERF_REG_X86_R15, PERF_REG_X86_AX))
+#define PERF_REG_X86_EXT_RESERVED	(~GENMASK_ULL(PERF_REG_MISC_MAX - 1, PERF_REG_X86_AX))
 
 #ifdef CONFIG_X86_32
-#define REG_NOSUPPORT ((1ULL << PERF_REG_X86_R8) | \
-		       (1ULL << PERF_REG_X86_R9) | \
-		       (1ULL << PERF_REG_X86_R10) | \
-		       (1ULL << PERF_REG_X86_R11) | \
-		       (1ULL << PERF_REG_X86_R12) | \
-		       (1ULL << PERF_REG_X86_R13) | \
-		       (1ULL << PERF_REG_X86_R14) | \
-		       (1ULL << PERF_REG_X86_R15))
-
-int perf_reg_validate(u64 mask)
+#define REG_NOSUPPORT GENMASK_ULL(PERF_REG_X86_R15, PERF_REG_X86_R8)
+
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
 		return -EINVAL;
@@ -182,10 +184,15 @@ u64 perf_reg_abi(struct task_struct *task)
 		       (1ULL << PERF_REG_X86_FS) | \
 		       (1ULL << PERF_REG_X86_GS))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	/* The mask could be 0 if only the SIMD registers are interested */
-	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
+	if (!simd_enabled &&
+	    (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
+		return -EINVAL;
+
+	if (simd_enabled &&
+	    (mask & (REG_NOSUPPORT | PERF_REG_X86_EXT_RESERVED)))
 		return -EINVAL;
 
 	return 0;
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 518f28c6a7d4..09dbc2fc3859 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -10,7 +10,7 @@ struct perf_regs {
 };
 
 u64 perf_reg_value(struct pt_regs *regs, int idx);
-int perf_reg_validate(u64 mask);
+int perf_reg_validate(u64 mask, bool simd_enabled);
 u64 perf_reg_abi(struct task_struct *task);
 void perf_get_regs_user(struct perf_regs *regs_user,
 			struct pt_regs *regs);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b19de038979e..428ff39e03c5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7436,7 +7436,7 @@ u64 __weak perf_reg_value(struct pt_regs *regs, int idx)
 	return 0;
 }
 
-int __weak perf_reg_validate(u64 mask)
+int __weak perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	return mask ? -ENOSYS : 0;
 }
@@ -13310,7 +13310,8 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_REGS_USER) {
-		ret = perf_reg_validate(attr->sample_regs_user);
+		ret = perf_reg_validate(attr->sample_regs_user,
+					attr->sample_simd_regs_enabled);
 		if (ret)
 			return ret;
 		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
@@ -13340,7 +13341,8 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		attr->sample_max_stack = sysctl_perf_event_max_stack;
 
 	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
-		ret = perf_reg_validate(attr->sample_regs_intr);
+		ret = perf_reg_validate(attr->sample_regs_intr,
+					attr->sample_simd_regs_enabled);
 		if (ret)
 			return ret;
 		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 13/19] perf/x86: Enable SSP sampling using sample_regs_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (11 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-05 12:20   ` Peter Zijlstra
  2025-12-03  6:54 ` [Patch v5 14/19] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch enables sampling of CET SSP register via the sample_regs_*
fields.

To sample SSP, the sample_simd_regs_enabled field must be set. This
allows the spare space (reclaimed from the original XMM space) in the
sample_regs_* fields to be used for representing SSP.

Similar with eGPRs sampling, the perf_reg_value() function needs to
check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then
determine whether to output SSP or legacy XMM registers to userspace.

Additionally, arch-PEBS supports sampling SSP, which is placed into the
GPRs group. This patch also enables arch-PEBS-based SSP sampling.

Currently, SSP sampling is only supported on the x86_64 architecture, as
CET is only available on x86_64 platforms.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                |  9 +++++++++
 arch/x86/events/intel/ds.c            |  3 +++
 arch/x86/events/perf_event.h          | 10 ++++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  7 ++++---
 arch/x86/kernel/perf_regs.c           |  5 +++++
 6 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index ec0838469cae..b6030dae561d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -434,6 +434,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 	if (valid_mask & XFEATURE_MASK_APX)
 		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
+	if (valid_mask & XFEATURE_MASK_CET_USER)
+		perf_regs->cet = get_xsave_addr(xsave, XFEATURE_CET_USER);
 }
 
 static void release_ext_regs_buffers(void)
@@ -736,6 +738,10 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_egprs(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
 				return -EINVAL;
+			if (event_needs_ssp(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_CET_USER))
+				return -EINVAL;
+
 			/* Not require any vector registers but set width */
 			if (event->attr.sample_simd_vec_reg_qwords &&
 			    !event->attr.sample_simd_vec_reg_intr &&
@@ -1852,6 +1858,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->h16zmm_regs = NULL;
 	perf_regs->opmask_regs = NULL;
 	perf_regs->egpr_regs = NULL;
+	perf_regs->cet_regs = NULL;
 }
 
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
@@ -1931,6 +1938,8 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_OPMASK;
 	if (event_needs_egprs(event))
 		mask |= XFEATURE_MASK_APX;
+	if (event_needs_ssp(event))
+		mask |= XFEATURE_MASK_CET_USER;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 79cba323eeb1..3212259d1a16 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2409,12 +2409,15 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 	}
 
 	if (header->gpr) {
+		ignore_mask = XFEATURE_MASK_CET_USER;
+
 		gprs = next_record;
 		next_record = gprs + 1;
 
 		__setup_pebs_gpr_group(event, data, regs,
 				       (struct pebs_gprs *)gprs,
 				       sample_type);
+		perf_regs->cet_regs = &gprs->r15;
 	}
 
 	if (header->aux) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 9fb1cbbc1b76..35a1837d0b77 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -193,6 +193,16 @@ static inline bool event_needs_egprs(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_ssp(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP) ||
+	     event->attr.sample_regs_intr & BIT_ULL(PERF_REG_X86_SSP)))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index ca242db3720f..c925af4160ad 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -729,6 +729,10 @@ struct x86_perf_regs {
 		u64	*egpr_regs;
 		struct apx_state *egpr;
 	};
+	union {
+		u64	*cet_regs;
+		struct cet_user_state *cet;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index f145e3b78426..f3561ed10041 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -28,9 +28,9 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
 	/*
-	 * The EGPRs and XMM have overlaps. Only one can be used
+	 * The EGPRs/SSP and XMM have overlaps. Only one can be used
 	 * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
-	 * utilize EGPRs. For the other ABI type, XMM is used.
+	 * utilize EGPRs/SSP. For the other ABI type, XMM is used.
 	 *
 	 * Extended GPRs (EGPRs)
 	 */
@@ -50,10 +50,11 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R29,
 	PERF_REG_X86_R30,
 	PERF_REG_X86_R31,
+	PERF_REG_X86_SSP,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
-	PERF_REG_MISC_MAX = PERF_REG_X86_R31 + 1,
+	PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index e76de39e1385..518bbe577ee8 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -70,6 +70,11 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 					return 0;
 				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
 			}
+			if (idx == PERF_REG_X86_SSP) {
+				if (!perf_regs->cet)
+					return 0;
+				return perf_regs->cet->user_ssp;
+			}
 		} else {
 			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 				if (!perf_regs->xmm_regs)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 14/19] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (12 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 13/19] perf/x86: Enable SSP " Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 15/19] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Enable the PERF_PMU_CAP_SIMD_REGS capability if XSAVES support is
available for YMM, ZMM, OPMASK, eGPRs, or SSP.

Temporarily disable large PEBS sampling for these registers, as the
current arch-PEBS sampling code does not support them yet. Large PEBS
sampling for these registers will be enabled in subsequent patches.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c | 50 +++++++++++++++++++++++++++++++++---
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index b5c89e8eabb2..d8cc7abfcdc6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4160,10 +4160,32 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 		flags &= ~PERF_SAMPLE_TIME;
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
-		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_intr & ~PEBS_GP_REGS)
-		flags &= ~PERF_SAMPLE_REGS_INTR;
+	if (event->attr.sample_simd_regs_enabled) {
+		u64 nolarge = PERF_X86_EGPRS_MASK | BIT_ULL(PERF_REG_X86_SSP);
+
+		/*
+		 * PEBS HW can only collect the XMM0-XMM15 for now.
+		 * Disable large PEBS for other vector registers, predicate
+		 * registers, eGPRs, and SSP.
+		 */
+		if (event->attr.sample_regs_user & nolarge ||
+		    fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE ||
+		    event->attr.sample_simd_pred_reg_user)
+			flags &= ~PERF_SAMPLE_REGS_USER;
+
+		if (event->attr.sample_regs_intr & nolarge ||
+		    fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+		    event->attr.sample_simd_pred_reg_intr)
+			flags &= ~PERF_SAMPLE_REGS_INTR;
+
+		if (event->attr.sample_simd_vec_reg_qwords > PERF_X86_XMM_QWORDS)
+			flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
+	} else {
+		if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
+			flags &= ~PERF_SAMPLE_REGS_USER;
+		if (event->attr.sample_regs_intr & ~PEBS_GP_REGS)
+			flags &= ~PERF_SAMPLE_REGS_INTR;
+	}
 	return flags;
 }
 
@@ -5643,6 +5665,26 @@ static void intel_extended_regs_init(struct pmu *pmu)
 
 	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
 	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+
+	if (boot_cpu_has(X86_FEATURE_AVX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_YMM, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_YMM;
+	if (boot_cpu_has(X86_FEATURE_APX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_APX, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_APX;
+	if (boot_cpu_has(X86_FEATURE_AVX512F)) {
+		if (cpu_has_xfeatures(XFEATURE_MASK_OPMASK, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_OPMASK;
+		if (cpu_has_xfeatures(XFEATURE_MASK_ZMM_Hi256, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_ZMM_Hi256;
+		if (cpu_has_xfeatures(XFEATURE_MASK_Hi16_ZMM, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_Hi16_ZMM;
+	}
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_CET_USER;
+
+	if (x86_pmu.ext_regs_mask != XFEATURE_MASK_SSE)
+		x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_SIMD_REGS;
 }
 
 #define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 15/19] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (13 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 14/19] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

This patch enables arch-PEBS based SIMD/eGPRs/SSP registers sampling.

Arch-PEBS supports sampling of these registers, with all except SSP
placed into the XSAVE-Enabled Registers (XER) group with the layout
described below.

Field Name 	Registers Used 			Size
----------------------------------------------------------------------
XSTATE_BV	XINUSE for groups		8 B
----------------------------------------------------------------------
Reserved 	Reserved 			8 B
----------------------------------------------------------------------
SSER 		XMM0-XMM15 			16 regs * 16 B = 256 B
----------------------------------------------------------------------
YMMHIR 		Upper 128 bits of YMM0-YMM15 	16 regs * 16 B = 256 B
----------------------------------------------------------------------
EGPR 		R16-R31 			16 regs * 8 B = 128 B
----------------------------------------------------------------------
OPMASKR 	K0-K7 				8 regs * 8 B = 64 B
----------------------------------------------------------------------
ZMMHIR 		Upper 256 bits of ZMM0-ZMM15 	16 regs * 32 B = 512 B
----------------------------------------------------------------------
Hi16ZMMR 	ZMM16-ZMM31 			16 regs * 64 B = 1024 B
----------------------------------------------------------------------

Memory space in the output buffer is allocated for these sub-groups as
long as the corresponding Format.XER[55:49] bits in the PEBS record
header are set. However, the arch-PEBS hardware engine does not write
the sub-group if it is not used (in INIT state). In such cases, the
corresponding bit in the XSTATE_BV bitmap is set to 0. Therefore, the
XSTATE_BV field is checked to determine if the register data is actually
written for each PEBS record. If not, the register data is not outputted
to userspace.

The SSP register is sampled and placed into the GPRs group by arch-PEBS.

Additionally, the MSRs IA32_PMC_{GPn|FXm}_CFG_C.[55:49] bits are used to
manage which types of these registers need to be sampled.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c      | 71 +++++++++++++++++++++--------
 arch/x86/events/intel/ds.c        | 76 ++++++++++++++++++++++++++++---
 arch/x86/include/asm/msr-index.h  |  7 +++
 arch/x86/include/asm/perf_event.h |  8 +++-
 4 files changed, 137 insertions(+), 25 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d8cc7abfcdc6..da48bcde8fce 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3008,6 +3008,21 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
 			if (pebs_data_cfg & PEBS_DATACFG_XMMS)
 				ext |= ARCH_PEBS_VECR_XMM & cap.caps;
 
+			if (pebs_data_cfg & PEBS_DATACFG_YMMHS)
+				ext |= ARCH_PEBS_VECR_YMMH & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_EGPRS)
+				ext |= ARCH_PEBS_VECR_EGPRS & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
+				ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
+				ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
+				ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
+
 			if (pebs_data_cfg & PEBS_DATACFG_LBRS)
 				ext |= ARCH_PEBS_LBR & cap.caps;
 
@@ -4152,6 +4167,30 @@ static void intel_pebs_aliases_skl(struct perf_event *event)
 	return intel_pebs_aliases_precdist(event);
 }
 
+static inline bool intel_pebs_support_regs(struct perf_event *event, u64 regs)
+{
+	struct arch_pebs_cap cap = hybrid(event->pmu, arch_pebs_cap);
+	bool supported = true;
+
+	/* SSP */
+	if (regs & PEBS_DATACFG_GP)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_GPR & cap.caps);
+	if (regs & PEBS_DATACFG_XMMS)
+		supported &= x86_pmu.intel_cap.pebs_format > 3;
+	if (regs & PEBS_DATACFG_YMMHS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_YMMH & cap.caps);
+	if (regs & PEBS_DATACFG_EGPRS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_EGPRS & cap.caps);
+	if (regs & PEBS_DATACFG_OPMASKS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_OPMASK & cap.caps);
+	if (regs & PEBS_DATACFG_ZMMHS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_ZMMH & cap.caps);
+	if (regs & PEBS_DATACFG_H16ZMMS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_H16ZMM & cap.caps);
+
+	return supported;
+}
+
 static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.large_pebs_flags;
@@ -4161,24 +4200,20 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
 	if (event->attr.sample_simd_regs_enabled) {
-		u64 nolarge = PERF_X86_EGPRS_MASK | BIT_ULL(PERF_REG_X86_SSP);
-
-		/*
-		 * PEBS HW can only collect the XMM0-XMM15 for now.
-		 * Disable large PEBS for other vector registers, predicate
-		 * registers, eGPRs, and SSP.
-		 */
-		if (event->attr.sample_regs_user & nolarge ||
-		    fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE ||
-		    event->attr.sample_simd_pred_reg_user)
-			flags &= ~PERF_SAMPLE_REGS_USER;
-
-		if (event->attr.sample_regs_intr & nolarge ||
-		    fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
-		    event->attr.sample_simd_pred_reg_intr)
-			flags &= ~PERF_SAMPLE_REGS_INTR;
-
-		if (event->attr.sample_simd_vec_reg_qwords > PERF_X86_XMM_QWORDS)
+		if ((event_needs_ssp(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_GP)) ||
+		    (event_needs_xmm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_XMMS)) ||
+		    (event_needs_ymm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_YMMHS)) ||
+		    (event_needs_egprs(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_EGPRS)) ||
+		    (event_needs_opmask(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_OPMASKS)) ||
+		    (event_needs_low16_zmm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_ZMMHS)) ||
+		    (event_needs_high16_zmm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_H16ZMMS)))
 			flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
 	} else {
 		if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 3212259d1a16..a01c72c03bd6 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1470,11 +1470,21 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 		     ((attr->config & INTEL_ARCH_EVENT_MASK) ==
 		      x86_pmu.rtm_abort_event);
 
-	if (gprs || (attr->precise_ip < 2) || tsx_weight)
+	if (gprs || (attr->precise_ip < 2) || tsx_weight || event_needs_ssp(event))
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
 	if (event_needs_xmm(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
+	if (event_needs_ymm(event))
+		pebs_data_cfg |= PEBS_DATACFG_YMMHS;
+	if (event_needs_low16_zmm(event))
+		pebs_data_cfg |= PEBS_DATACFG_ZMMHS;
+	if (event_needs_high16_zmm(event))
+		pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
+	if (event_needs_opmask(event))
+		pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
+	if (event_needs_egprs(event))
+		pebs_data_cfg |= PEBS_DATACFG_EGPRS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
 		/*
@@ -2430,15 +2440,69 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 					   meminfo->tsx_tuning, ax);
 	}
 
-	if (header->xmm) {
+	if (header->xmm || header->ymmh || header->egpr ||
+	    header->opmask || header->zmmh || header->h16zmm) {
+		struct arch_pebs_xer_header *xer_header = next_record;
 		struct pebs_xmm *xmm;
+		struct ymmh_struct *ymmh;
+		struct avx_512_zmm_uppers_state *zmmh;
+		struct avx_512_hi16_state *h16zmm;
+		struct avx_512_opmask_state *opmask;
+		struct apx_state *egpr;
 
 		next_record += sizeof(struct arch_pebs_xer_header);
 
-		ignore_mask |= XFEATURE_MASK_SSE;
-		xmm = next_record;
-		perf_regs->xmm_regs = xmm->xmm;
-		next_record = xmm + 1;
+		if (header->xmm) {
+			ignore_mask |= XFEATURE_MASK_SSE;
+			xmm = next_record;
+			/*
+			 * Only output XMM regs to user space when arch-PEBS
+			 * really writes data into xstate area.
+			 */
+			if (xer_header->xstate & XFEATURE_MASK_SSE)
+				perf_regs->xmm_regs = xmm->xmm;
+			next_record = xmm + 1;
+		}
+
+		if (header->ymmh) {
+			ignore_mask |= XFEATURE_MASK_YMM;
+			ymmh = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_YMM)
+				perf_regs->ymmh = ymmh;
+			next_record = ymmh + 1;
+		}
+
+		if (header->egpr) {
+			ignore_mask |= XFEATURE_MASK_APX;
+			egpr = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_APX)
+				perf_regs->egpr = egpr;
+			next_record = egpr + 1;
+		}
+
+		if (header->opmask) {
+			ignore_mask |= XFEATURE_MASK_OPMASK;
+			opmask = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_OPMASK)
+				perf_regs->opmask = opmask;
+			next_record = opmask + 1;
+		}
+
+		if (header->zmmh) {
+			ignore_mask |= XFEATURE_MASK_ZMM_Hi256;
+			zmmh = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_ZMM_Hi256)
+				perf_regs->zmmh = zmmh;
+			next_record = zmmh + 1;
+		}
+
+		if (header->h16zmm) {
+			ignore_mask |= XFEATURE_MASK_Hi16_ZMM;
+			h16zmm = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_Hi16_ZMM)
+				perf_regs->h16zmm = h16zmm;
+			next_record = h16zmm + 1;
+		}
 	}
 
 	if (header->lbr) {
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 65cc528fbad8..3f1cc294b1e9 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -341,6 +341,13 @@
 #define ARCH_PEBS_LBR_SHIFT		40
 #define ARCH_PEBS_LBR			(0x3ull << ARCH_PEBS_LBR_SHIFT)
 #define ARCH_PEBS_VECR_XMM		BIT_ULL(49)
+#define ARCH_PEBS_VECR_YMMH		BIT_ULL(50)
+#define ARCH_PEBS_VECR_EGPRS		BIT_ULL(51)
+#define ARCH_PEBS_VECR_OPMASK		BIT_ULL(53)
+#define ARCH_PEBS_VECR_ZMMH		BIT_ULL(54)
+#define ARCH_PEBS_VECR_H16ZMM		BIT_ULL(55)
+#define ARCH_PEBS_VECR_EXT_SHIFT	50
+#define ARCH_PEBS_VECR_EXT		(0x3full << ARCH_PEBS_VECR_EXT_SHIFT)
 #define ARCH_PEBS_GPR			BIT_ULL(61)
 #define ARCH_PEBS_AUX			BIT_ULL(62)
 #define ARCH_PEBS_EN			BIT_ULL(63)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index c925af4160ad..41668a4633df 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -146,6 +146,11 @@
 #define PEBS_DATACFG_LBRS	BIT_ULL(3)
 #define PEBS_DATACFG_CNTR	BIT_ULL(4)
 #define PEBS_DATACFG_METRICS	BIT_ULL(5)
+#define PEBS_DATACFG_YMMHS	BIT_ULL(6)
+#define PEBS_DATACFG_OPMASKS	BIT_ULL(7)
+#define PEBS_DATACFG_ZMMHS	BIT_ULL(8)
+#define PEBS_DATACFG_H16ZMMS	BIT_ULL(9)
+#define PEBS_DATACFG_EGPRS	BIT_ULL(10)
 #define PEBS_DATACFG_LBR_SHIFT	24
 #define PEBS_DATACFG_CNTR_SHIFT	32
 #define PEBS_DATACFG_CNTR_MASK	GENMASK_ULL(15, 0)
@@ -540,7 +545,8 @@ struct arch_pebs_header {
 			    rsvd3:7,
 			    xmm:1,
 			    ymmh:1,
-			    rsvd4:2,
+			    egpr:1,
+			    rsvd4:1,
 			    opmask:1,
 			    zmmh:1,
 			    h16zmm:1,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (14 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 15/19] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-05 12:39   ` Peter Zijlstra
  2025-12-03  6:54 ` [Patch v5 17/19] perf headers: Sync with the kernel headers Dapeng Mi
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

When two or more identical PEBS events with the same sampling period are
programmed on a mix of PDIST and non-PDIST counters, multiple
back-to-back NMIs can be triggered.

The Linux PMI handler processes the first NMI and clears the
GLOBAL_STATUS MSR. If a second NMI is triggered immediately after
the first, it is recognized as a "suspicious NMI" because no bits are set
in the GLOBAL_STATUS MSR (cleared by the first NMI).

This issue does not lead to PEBS data corruption or data loss, but it
does result in an annoying warning message.

The current NMI handler supports back-to-back NMI detection, but it
requires the PMI handler to return the count of actually processed events,
which the PEBS handler does not currently do.

This patch modifies the PEBS handler to return the count of actually
processed events, thereby activating back-to-back NMI detection and
avoiding the "suspicious NMI" warning.

Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c |  3 +--
 arch/x86/events/intel/ds.c   | 36 +++++++++++++++++++++++-------------
 arch/x86/events/perf_event.h |  2 +-
 3 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index da48bcde8fce..a130d3f14844 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3351,8 +3351,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 	 */
 	if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
 				 (unsigned long *)&status)) {
-		handled++;
-		static_call(x86_pmu_drain_pebs)(regs, &data);
+		handled += static_call(x86_pmu_drain_pebs)(regs, &data);
 
 		if (cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS] &&
 		    is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS]))
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index a01c72c03bd6..c7cdcd585574 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2759,7 +2759,7 @@ __intel_pmu_pebs_events(struct perf_event *event,
 	__intel_pmu_pebs_last_event(event, iregs, regs, data, at, count, setup_sample);
 }
 
-static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
@@ -2768,7 +2768,7 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
 	int n;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	at  = (struct pebs_record_core *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_record_core *)(unsigned long)ds->pebs_index;
@@ -2779,22 +2779,24 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
 	ds->pebs_index = ds->pebs_buffer_base;
 
 	if (!test_bit(0, cpuc->active_mask))
-		return;
+		return 0;
 
 	WARN_ON_ONCE(!event);
 
 	if (!event->attr.precise_ip)
-		return;
+		return 0;
 
 	n = top - at;
 	if (n <= 0) {
 		if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
 			intel_pmu_save_and_restart_reload(event, 0);
-		return;
+		return 0;
 	}
 
 	__intel_pmu_pebs_events(event, iregs, data, at, top, 0, n,
 				setup_pebs_fixed_sample_data);
+
+	return 0;
 }
 
 static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64 mask)
@@ -2817,7 +2819,7 @@ static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64
 	}
 }
 
-static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
@@ -2830,7 +2832,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 	u64 mask;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
@@ -2846,7 +2848,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 
 	if (unlikely(base >= top)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
-		return;
+		return 0;
 	}
 
 	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
@@ -2931,6 +2933,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 						setup_pebs_fixed_sample_data);
 		}
 	}
+
+	return 0;
 }
 
 static __always_inline void
@@ -2984,7 +2988,7 @@ __intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
 
 }
 
-static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
 	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
@@ -2997,7 +3001,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 	u64 mask;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
@@ -3010,7 +3014,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 
 	if (unlikely(base >= top)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
-		return;
+		return 0;
 	}
 
 	if (!iregs)
@@ -3032,9 +3036,11 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 
 	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
 					    setup_pebs_adaptive_sample_data);
+
+	return 0;
 }
 
-static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
+static int intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 				      struct perf_sample_data *data)
 {
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
@@ -3044,13 +3050,14 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 	struct x86_perf_regs perf_regs;
 	struct pt_regs *regs = &perf_regs.regs;
 	void *base, *at, *top;
+	u64 events_bitmap = 0;
 	u64 mask;
 
 	rdmsrq(MSR_IA32_PEBS_INDEX, index.whole);
 
 	if (unlikely(!index.wr)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
-		return;
+		return 0;
 	}
 
 	base = cpuc->pebs_vaddr;
@@ -3089,6 +3096,7 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 
 		basic = at + sizeof(struct arch_pebs_header);
 		pebs_status = mask & basic->applicable_counters;
+		events_bitmap |= pebs_status;
 		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
 					       pebs_status, counts, last,
 					       setup_arch_pebs_sample_data);
@@ -3108,6 +3116,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask,
 					    counts, last,
 					    setup_arch_pebs_sample_data);
+
+	return hweight64(events_bitmap);
 }
 
 static void __init intel_arch_pebs_init(void)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 35a1837d0b77..98958f6d29b6 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1003,7 +1003,7 @@ struct x86_pmu {
 	int		pebs_record_size;
 	int		pebs_buffer_size;
 	u64		pebs_events_mask;
-	void		(*drain_pebs)(struct pt_regs *regs, struct perf_sample_data *data);
+	int		(*drain_pebs)(struct pt_regs *regs, struct perf_sample_data *data);
 	struct event_constraint *pebs_constraints;
 	void		(*pebs_aliases)(struct perf_event *event);
 	u64		(*pebs_latency_data)(struct perf_event *event, u64 status);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 17/19] perf headers: Sync with the kernel headers
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (15 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03 23:43   ` Ian Rogers
  2025-12-03  6:54 ` [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format Dapeng Mi
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Update include/uapi/linux/perf_event.h and
arch/x86/include/uapi/asm/perf_regs.h to support extended regs.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
 tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
 2 files changed, 103 insertions(+), 4 deletions(-)

diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..f3561ed10041 100644
--- a/tools/arch/x86/include/uapi/asm/perf_regs.h
+++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,34 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/*
+	 * The EGPRs/SSP and XMM have overlaps. Only one can be used
+	 * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
+	 * utilize EGPRs/SSP. For the other ABI type, XMM is used.
+	 *
+	 * Extended GPRs (EGPRs)
+	 */
+	PERF_REG_X86_R16,
+	PERF_REG_X86_R17,
+	PERF_REG_X86_R18,
+	PERF_REG_X86_R19,
+	PERF_REG_X86_R20,
+	PERF_REG_X86_R21,
+	PERF_REG_X86_R22,
+	PERF_REG_X86_R23,
+	PERF_REG_X86_R24,
+	PERF_REG_X86_R25,
+	PERF_REG_X86_R26,
+	PERF_REG_X86_R27,
+	PERF_REG_X86_R28,
+	PERF_REG_X86_R29,
+	PERF_REG_X86_R30,
+	PERF_REG_X86_R31,
+	PERF_REG_X86_SSP,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
@@ -54,5 +79,42 @@ enum perf_event_x86_regs {
 };
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
+#define PERF_X86_EGPRS_MASK	GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
+
+enum {
+	PERF_REG_X86_XMM,
+	PERF_REG_X86_YMM,
+	PERF_REG_X86_ZMM,
+	PERF_REG_X86_MAX_SIMD_REGS,
+
+	PERF_REG_X86_OPMASK = 0,
+	PERF_REG_X86_MAX_PRED_REGS = 1,
+};
+
+enum {
+	PERF_X86_SIMD_XMM_REGS      = 16,
+	PERF_X86_SIMD_YMM_REGS      = 16,
+	PERF_X86_SIMD_ZMMH_REGS     = 16,
+	PERF_X86_SIMD_ZMM_REGS      = 32,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
+
+	PERF_X86_SIMD_OPMASK_REGS   = 8,
+	PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
+};
+
+#define PERF_X86_SIMD_PRED_MASK		GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
+#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
+
+#define PERF_X86_H16ZMM_BASE		PERF_X86_SIMD_ZMMH_REGS
+
+enum {
+	PERF_X86_OPMASK_QWORDS   = 1,
+	PERF_X86_XMM_QWORDS      = 2,
+	PERF_X86_YMMH_QWORDS     = 2,
+	PERF_X86_YMM_QWORDS      = 4,
+	PERF_X86_ZMMH_QWORDS     = 4,
+	PERF_X86_ZMM_QWORDS      = 8,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
+};
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index d292f96bc06f..f1474da32622 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -314,8 +314,9 @@ enum {
  */
 enum perf_sample_regs_abi {
 	PERF_SAMPLE_REGS_ABI_NONE		= 0,
-	PERF_SAMPLE_REGS_ABI_32			= 1,
-	PERF_SAMPLE_REGS_ABI_64			= 2,
+	PERF_SAMPLE_REGS_ABI_32			= (1 << 0),
+	PERF_SAMPLE_REGS_ABI_64			= (1 << 1),
+	PERF_SAMPLE_REGS_ABI_SIMD		= (1 << 2),
 };
 
 /*
@@ -382,6 +383,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
 #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
+#define PERF_ATTR_SIZE_VER9			168	/* Add: sample_simd_{pred,vec}_reg_* */
 
 /*
  * 'struct perf_event_attr' contains various attributes that define
@@ -545,6 +547,25 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+
+	/*
+	 * Defines set of SIMD registers to dump on samples.
+	 * The sample_simd_regs_enabled !=0 implies the
+	 * set of SIMD registers is used to config all SIMD registers.
+	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+	 * config some SIMD registers on X86.
+	 */
+	union {
+		__u16 sample_simd_regs_enabled;
+		__u16 sample_simd_pred_reg_qwords;
+	};
+	__u32 sample_simd_pred_reg_intr;
+	__u32 sample_simd_pred_reg_user;
+	__u16 sample_simd_vec_reg_qwords;
+	__u64 sample_simd_vec_reg_intr;
+	__u64 sample_simd_vec_reg_user;
+	__u32 __reserved_4;
 };
 
 /*
@@ -1018,7 +1039,15 @@ enum perf_event_type {
 	 *      } && PERF_SAMPLE_BRANCH_STACK
 	 *
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_USER
 	 *
 	 *	{ u64			size;
 	 *	  char			data[size];
@@ -1045,7 +1074,15 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (16 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 17/19] perf headers: Sync with the kernel headers Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-04  0:17   ` Ian Rogers
  2025-12-03  6:55 ` [Patch v5 19/19] perf regs: Enable dumping of SIMD registers Dapeng Mi
  2025-12-04  0:24 ` [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Ian Rogers
  19 siblings, 1 reply; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for the newly introduced SIMD register sampling
format by adding the following functions:

uint64_t arch__intr_simd_reg_mask(void);
uint64_t arch__user_simd_reg_mask(void);
uint64_t arch__intr_pred_reg_mask(void);
uint64_t arch__user_pred_reg_mask(void);
uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);

The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.

The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
supported PRED registers, such as OPMASK on x86 platforms.

The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
exact bitmap and number of qwords for a specific type of SIMD register.
For example, for XMM registers on x86 platforms, the returned bitmap is
0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).

The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
exact bitmap and number of qwords for a specific type of PRED register.
For example, for OPMASK registers on x86 platforms, the returned bitmap
is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
OPMASK).

Additionally, the function __parse_regs() is enhanced to support parsing
these newly introduced SIMD registers. Currently, each type of register
can only be sampled collectively; sampling a specific SIMD register is
not supported. For example, all XMM registers are sampled together rather
than sampling only XMM0.

When multiple overlapping register types, such as XMM and YMM, are
sampled simultaneously, only the superset (YMM registers) is sampled.

With this patch, all supported sampling registers on x86 platforms are
displayed as follows.

 $perf record -I?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

 $perf record --user-regs=?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
 tools/perf/util/evsel.c                   |  27 ++
 tools/perf/util/parse-regs-options.c      | 151 ++++++-
 tools/perf/util/perf_event_attr_fprintf.c |   6 +
 tools/perf/util/perf_regs.c               |  59 +++
 tools/perf/util/perf_regs.h               |  11 +
 tools/perf/util/record.h                  |   6 +
 7 files changed, 714 insertions(+), 16 deletions(-)

diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
index 12fd93f04802..db41430f3b07 100644
--- a/tools/perf/arch/x86/util/perf_regs.c
+++ b/tools/perf/arch/x86/util/perf_regs.c
@@ -13,6 +13,49 @@
 #include "../../../util/pmu.h"
 #include "../../../util/pmus.h"
 
+static const struct sample_reg sample_reg_masks_ext[] = {
+	SMPL_REG(AX, PERF_REG_X86_AX),
+	SMPL_REG(BX, PERF_REG_X86_BX),
+	SMPL_REG(CX, PERF_REG_X86_CX),
+	SMPL_REG(DX, PERF_REG_X86_DX),
+	SMPL_REG(SI, PERF_REG_X86_SI),
+	SMPL_REG(DI, PERF_REG_X86_DI),
+	SMPL_REG(BP, PERF_REG_X86_BP),
+	SMPL_REG(SP, PERF_REG_X86_SP),
+	SMPL_REG(IP, PERF_REG_X86_IP),
+	SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
+	SMPL_REG(CS, PERF_REG_X86_CS),
+	SMPL_REG(SS, PERF_REG_X86_SS),
+#ifdef HAVE_ARCH_X86_64_SUPPORT
+	SMPL_REG(R8, PERF_REG_X86_R8),
+	SMPL_REG(R9, PERF_REG_X86_R9),
+	SMPL_REG(R10, PERF_REG_X86_R10),
+	SMPL_REG(R11, PERF_REG_X86_R11),
+	SMPL_REG(R12, PERF_REG_X86_R12),
+	SMPL_REG(R13, PERF_REG_X86_R13),
+	SMPL_REG(R14, PERF_REG_X86_R14),
+	SMPL_REG(R15, PERF_REG_X86_R15),
+	SMPL_REG(R16, PERF_REG_X86_R16),
+	SMPL_REG(R17, PERF_REG_X86_R17),
+	SMPL_REG(R18, PERF_REG_X86_R18),
+	SMPL_REG(R19, PERF_REG_X86_R19),
+	SMPL_REG(R20, PERF_REG_X86_R20),
+	SMPL_REG(R21, PERF_REG_X86_R21),
+	SMPL_REG(R22, PERF_REG_X86_R22),
+	SMPL_REG(R23, PERF_REG_X86_R23),
+	SMPL_REG(R24, PERF_REG_X86_R24),
+	SMPL_REG(R25, PERF_REG_X86_R25),
+	SMPL_REG(R26, PERF_REG_X86_R26),
+	SMPL_REG(R27, PERF_REG_X86_R27),
+	SMPL_REG(R28, PERF_REG_X86_R28),
+	SMPL_REG(R29, PERF_REG_X86_R29),
+	SMPL_REG(R30, PERF_REG_X86_R30),
+	SMPL_REG(R31, PERF_REG_X86_R31),
+	SMPL_REG(SSP, PERF_REG_X86_SSP),
+#endif
+	SMPL_REG_END
+};
+
 static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG(AX, PERF_REG_X86_AX),
 	SMPL_REG(BX, PERF_REG_X86_BX),
@@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
 	return SDT_ARG_VALID;
 }
 
+static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
+{
+	struct perf_event_attr attr = {
+		.type				= PERF_TYPE_HARDWARE,
+		.config				= PERF_COUNT_HW_CPU_CYCLES,
+		.sample_type			= sample_type,
+		.disabled			= 1,
+		.exclude_kernel			= 1,
+		.sample_simd_regs_enabled	= 1,
+	};
+	int fd;
+
+	attr.sample_period = 1;
+
+	if (!pred) {
+		attr.sample_simd_vec_reg_qwords = qwords;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			attr.sample_simd_vec_reg_intr = mask;
+		else
+			attr.sample_simd_vec_reg_user = mask;
+	} else {
+		attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
+		else
+			attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
+	}
+
+	if (perf_pmus__num_core_pmus() > 1) {
+		struct perf_pmu *pmu = NULL;
+		__u64 type = PERF_TYPE_RAW;
+
+		/*
+		 * The same register set is supported among different hybrid PMUs.
+		 * Only check the first available one.
+		 */
+		while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
+			type = pmu->type;
+			break;
+		}
+		attr.config |= type << PERF_PMU_TYPE_SHIFT;
+	}
+
+	event_attr_init(&attr);
+
+	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+	if (fd != -1) {
+		close(fd);
+		return true;
+	}
+
+	return false;
+}
+
+static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
+{
+	bool supported = false;
+	u64 bits;
+
+	*mask = 0;
+	*qwords = 0;
+
+	switch (reg) {
+	case PERF_REG_X86_XMM:
+		bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_XMM_QWORDS;
+		}
+		break;
+	case PERF_REG_X86_YMM:
+		bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_YMM_QWORDS;
+		}
+		break;
+	case PERF_REG_X86_ZMM:
+		bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_ZMM_QWORDS;
+			break;
+		}
+
+		bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_ZMMH_QWORDS;
+		}
+		break;
+	default:
+		break;
+	}
+
+	return supported;
+}
+
+static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
+{
+	bool supported = false;
+	u64 bits;
+
+	*mask = 0;
+	*qwords = 0;
+
+	switch (reg) {
+	case PERF_REG_X86_OPMASK:
+		bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_OPMASK_QWORDS;
+		}
+		break;
+	default:
+		break;
+	}
+
+	return supported;
+}
+
+static bool has_cap_simd_regs(void)
+{
+	uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
+	u16 qwords = PERF_X86_XMM_QWORDS;
+	static bool has_cap_simd_regs;
+	static bool cached;
+
+	if (cached)
+		return has_cap_simd_regs;
+
+	has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
+						 PERF_REG_X86_XMM, &mask, &qwords);
+	has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
+						 PERF_REG_X86_XMM, &mask, &qwords);
+	cached = true;
+
+	return has_cap_simd_regs;
+}
+
+bool arch_has_simd_regs(u64 mask)
+{
+	return has_cap_simd_regs() &&
+	       mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
+}
+
+static const struct sample_reg sample_simd_reg_masks[] = {
+	SMPL_REG(XMM, PERF_REG_X86_XMM),
+	SMPL_REG(YMM, PERF_REG_X86_YMM),
+	SMPL_REG(ZMM, PERF_REG_X86_ZMM),
+	SMPL_REG_END
+};
+
+static const struct sample_reg sample_pred_reg_masks[] = {
+	SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
+	SMPL_REG_END
+};
+
+const struct sample_reg *arch__sample_simd_reg_masks(void)
+{
+	return sample_simd_reg_masks;
+}
+
+const struct sample_reg *arch__sample_pred_reg_masks(void)
+{
+	return sample_pred_reg_masks;
+}
+
+static bool x86_intr_simd_updated;
+static u64 x86_intr_simd_reg_mask;
+static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
+static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
+static bool x86_user_simd_updated;
+static u64 x86_user_simd_reg_mask;
+static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
+static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
+
+static bool x86_intr_pred_updated;
+static u64 x86_intr_pred_reg_mask;
+static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
+static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
+static bool x86_user_pred_updated;
+static u64 x86_user_pred_reg_mask;
+static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
+static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
+
+static uint64_t __arch__simd_reg_mask(u64 sample_type)
+{
+	const struct sample_reg *r = NULL;
+	bool supported;
+	u64 mask = 0;
+	int reg;
+
+	if (!has_cap_simd_regs())
+		return 0;
+
+	if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
+		return x86_intr_simd_reg_mask;
+
+	if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
+		return x86_user_simd_reg_mask;
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		supported = false;
+
+		if (!r->mask)
+			continue;
+		reg = fls64(r->mask) - 1;
+
+		if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
+			break;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			supported = __arch_simd_reg_mask(sample_type, reg,
+							 &x86_intr_simd_mask[reg],
+							 &x86_intr_simd_qwords[reg]);
+		else if (sample_type == PERF_SAMPLE_REGS_USER)
+			supported = __arch_simd_reg_mask(sample_type, reg,
+							 &x86_user_simd_mask[reg],
+							 &x86_user_simd_qwords[reg]);
+		if (supported)
+			mask |= BIT_ULL(reg);
+	}
+
+	if (sample_type == PERF_SAMPLE_REGS_INTR) {
+		x86_intr_simd_reg_mask = mask;
+		x86_intr_simd_updated = true;
+	} else {
+		x86_user_simd_reg_mask = mask;
+		x86_user_simd_updated = true;
+	}
+
+	return mask;
+}
+
+static uint64_t __arch__pred_reg_mask(u64 sample_type)
+{
+	const struct sample_reg *r = NULL;
+	bool supported;
+	u64 mask = 0;
+	int reg;
+
+	if (!has_cap_simd_regs())
+		return 0;
+
+	if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
+		return x86_intr_pred_reg_mask;
+
+	if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
+		return x86_user_pred_reg_mask;
+
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		supported = false;
+
+		if (!r->mask)
+			continue;
+		reg = fls64(r->mask) - 1;
+
+		if (reg >= PERF_REG_X86_MAX_PRED_REGS)
+			break;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			supported = __arch_pred_reg_mask(sample_type, reg,
+							 &x86_intr_pred_mask[reg],
+							 &x86_intr_pred_qwords[reg]);
+		else if (sample_type == PERF_SAMPLE_REGS_USER)
+			supported = __arch_pred_reg_mask(sample_type, reg,
+							 &x86_user_pred_mask[reg],
+							 &x86_user_pred_qwords[reg]);
+		if (supported)
+			mask |= BIT_ULL(reg);
+	}
+
+	if (sample_type == PERF_SAMPLE_REGS_INTR) {
+		x86_intr_pred_reg_mask = mask;
+		x86_intr_pred_updated = true;
+	} else {
+		x86_user_pred_reg_mask = mask;
+		x86_user_pred_updated = true;
+	}
+
+	return mask;
+}
+
+uint64_t arch__intr_simd_reg_mask(void)
+{
+	return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
+}
+
+uint64_t arch__user_simd_reg_mask(void)
+{
+	return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
+}
+
+uint64_t arch__intr_pred_reg_mask(void)
+{
+	return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
+}
+
+uint64_t arch__user_pred_reg_mask(void)
+{
+	return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
+}
+
+static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
+{
+	uint64_t mask = 0;
+
+	*qwords = 0;
+	if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
+		if (intr) {
+			*qwords = x86_intr_simd_qwords[reg];
+			mask = x86_intr_simd_mask[reg];
+		} else {
+			*qwords = x86_user_simd_qwords[reg];
+			mask = x86_user_simd_mask[reg];
+		}
+	}
+
+	return mask;
+}
+
+static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
+{
+	uint64_t mask = 0;
+
+	*qwords = 0;
+	if (reg < PERF_REG_X86_MAX_PRED_REGS) {
+		if (intr) {
+			*qwords = x86_intr_pred_qwords[reg];
+			mask = x86_intr_pred_mask[reg];
+		} else {
+			*qwords = x86_user_pred_qwords[reg];
+			mask = x86_user_pred_mask[reg];
+		}
+	}
+
+	return mask;
+}
+
+uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_intr_simd_updated)
+		arch__intr_simd_reg_mask();
+	return arch__simd_reg_bitmap_qwords(reg, qwords, true);
+}
+
+uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_user_simd_updated)
+		arch__user_simd_reg_mask();
+	return arch__simd_reg_bitmap_qwords(reg, qwords, false);
+}
+
+uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_intr_pred_updated)
+		arch__intr_pred_reg_mask();
+	return arch__pred_reg_bitmap_qwords(reg, qwords, true);
+}
+
+uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_user_pred_updated)
+		arch__user_pred_reg_mask();
+	return arch__pred_reg_bitmap_qwords(reg, qwords, false);
+}
+
 const struct sample_reg *arch__sample_reg_masks(void)
 {
+	if (has_cap_simd_regs())
+		return sample_reg_masks_ext;
 	return sample_reg_masks;
 }
 
-uint64_t arch__intr_reg_mask(void)
+static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
 {
 	struct perf_event_attr attr = {
-		.type			= PERF_TYPE_HARDWARE,
-		.config			= PERF_COUNT_HW_CPU_CYCLES,
-		.sample_type		= PERF_SAMPLE_REGS_INTR,
-		.sample_regs_intr	= PERF_REG_EXTENDED_MASK,
-		.precise_ip		= 1,
-		.disabled 		= 1,
-		.exclude_kernel		= 1,
+		.type				= PERF_TYPE_HARDWARE,
+		.config				= PERF_COUNT_HW_CPU_CYCLES,
+		.sample_type			= sample_type,
+		.precise_ip			= 1,
+		.disabled			= 1,
+		.exclude_kernel			= 1,
+		.sample_simd_regs_enabled	= has_simd_regs,
 	};
 	int fd;
 	/*
 	 * In an unnamed union, init it here to build on older gcc versions
 	 */
 	attr.sample_period = 1;
+	if (sample_type == PERF_SAMPLE_REGS_INTR)
+		attr.sample_regs_intr = mask;
+	else
+		attr.sample_regs_user = mask;
 
 	if (perf_pmus__num_core_pmus() > 1) {
 		struct perf_pmu *pmu = NULL;
@@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
 	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
 	if (fd != -1) {
 		close(fd);
-		return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
+		return mask;
 	}
 
-	return PERF_REGS_MASK;
+	return 0;
+}
+
+uint64_t arch__intr_reg_mask(void)
+{
+	uint64_t mask = PERF_REGS_MASK;
+
+	if (has_cap_simd_regs()) {
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
+					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
+					 true);
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
+					 BIT_ULL(PERF_REG_X86_SSP),
+					 true);
+	} else
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
+
+	return mask;
 }
 
 uint64_t arch__user_reg_mask(void)
 {
-	return PERF_REGS_MASK;
+	uint64_t mask = PERF_REGS_MASK;
+
+	if (has_cap_simd_regs()) {
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
+					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
+					 true);
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
+					 BIT_ULL(PERF_REG_X86_SSP),
+					 true);
+	}
+
+	return mask;
 }
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 56ebefd075f2..5d1d90cf9488 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 	if (opts->sample_intr_regs && !evsel->no_aux_samples &&
 	    !evsel__is_dummy_event(evsel)) {
 		attr->sample_regs_intr = opts->sample_intr_regs;
+		attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
+		evsel__set_sample_bit(evsel, REGS_INTR);
+	}
+
+	if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
+	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+		/* The pred qwords is to implies the set of SIMD registers is used */
+		if (opts->sample_pred_regs_qwords)
+			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
+		else
+			attr->sample_simd_pred_reg_qwords = 1;
+		attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
+		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
+		attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
 		evsel__set_sample_bit(evsel, REGS_INTR);
 	}
 
 	if (opts->sample_user_regs && !evsel->no_aux_samples &&
 	    !evsel__is_dummy_event(evsel)) {
 		attr->sample_regs_user |= opts->sample_user_regs;
+		attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
+		evsel__set_sample_bit(evsel, REGS_USER);
+	}
+
+	if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
+	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+		if (opts->sample_pred_regs_qwords)
+			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
+		else
+			attr->sample_simd_pred_reg_qwords = 1;
+		attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
+		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
+		attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
 		evsel__set_sample_bit(evsel, REGS_USER);
 	}
 
diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
index cda1c620968e..0bd100392889 100644
--- a/tools/perf/util/parse-regs-options.c
+++ b/tools/perf/util/parse-regs-options.c
@@ -4,19 +4,139 @@
 #include <stdint.h>
 #include <string.h>
 #include <stdio.h>
+#include <linux/bitops.h>
 #include "util/debug.h"
 #include <subcmd/parse-options.h>
 #include "util/perf_regs.h"
 #include "util/parse-regs-options.h"
+#include "record.h"
+
+static void __print_simd_regs(bool intr, uint64_t simd_mask)
+{
+	const struct sample_reg *r = NULL;
+	uint64_t bitmap = 0;
+	u16 qwords = 0;
+	int reg_idx;
+
+	if (!simd_mask)
+		return;
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		if (!(r->mask & simd_mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		if (bitmap)
+			fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
+	}
+}
+
+static void __print_pred_regs(bool intr, uint64_t pred_mask)
+{
+	const struct sample_reg *r = NULL;
+	uint64_t bitmap = 0;
+	u16 qwords = 0;
+	int reg_idx;
+
+	if (!pred_mask)
+		return;
+
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		if (!(r->mask & pred_mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		if (bitmap)
+			fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
+	}
+}
+
+static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
+{
+	const struct sample_reg *r = NULL;
+	bool matched = false;
+	uint64_t bitmap = 0;
+	u16 qwords = 0;
+	int reg_idx;
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		if (strcasecmp(s, r->name))
+			continue;
+		if (!fls64(r->mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		matched = true;
+		break;
+	}
+
+	/* Just need the highest qwords */
+	if (qwords > opts->sample_vec_regs_qwords) {
+		opts->sample_vec_regs_qwords = qwords;
+		if (intr)
+			opts->sample_intr_vec_regs = bitmap;
+		else
+			opts->sample_user_vec_regs = bitmap;
+	}
+
+	return matched;
+}
+
+static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
+{
+	const struct sample_reg *r = NULL;
+	bool matched = false;
+	uint64_t bitmap = 0;
+	u16 qwords = 0;
+	int reg_idx;
+
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		if (strcasecmp(s, r->name))
+			continue;
+		if (!fls64(r->mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		matched = true;
+		break;
+	}
+
+	/* Just need the highest qwords */
+	if (qwords > opts->sample_pred_regs_qwords) {
+		opts->sample_pred_regs_qwords = qwords;
+		if (intr)
+			opts->sample_intr_pred_regs = bitmap;
+		else
+			opts->sample_user_pred_regs = bitmap;
+	}
+
+	return matched;
+}
 
 static int
 __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 {
 	uint64_t *mode = (uint64_t *)opt->value;
 	const struct sample_reg *r = NULL;
+	struct record_opts *opts;
 	char *s, *os = NULL, *p;
-	int ret = -1;
+	bool has_simd_regs = false;
 	uint64_t mask;
+	uint64_t simd_mask;
+	uint64_t pred_mask;
+	int ret = -1;
 
 	if (unset)
 		return 0;
@@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 	if (*mode)
 		return -1;
 
-	if (intr)
+	if (intr) {
+		opts = container_of(opt->value, struct record_opts, sample_intr_regs);
 		mask = arch__intr_reg_mask();
-	else
+		simd_mask = arch__intr_simd_reg_mask();
+		pred_mask = arch__intr_pred_reg_mask();
+	} else {
+		opts = container_of(opt->value, struct record_opts, sample_user_regs);
 		mask = arch__user_reg_mask();
+		simd_mask = arch__user_simd_reg_mask();
+		pred_mask = arch__user_pred_reg_mask();
+	}
 
 	/* str may be NULL in case no arg is passed to -I */
 	if (str) {
@@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 					if (r->mask & mask)
 						fprintf(stderr, "%s ", r->name);
 				}
+				__print_simd_regs(intr, simd_mask);
+				__print_pred_regs(intr, pred_mask);
 				fputc('\n', stderr);
 				/* just printing available regs */
 				goto error;
 			}
+
+			if (simd_mask) {
+				has_simd_regs = __parse_simd_regs(opts, s, intr);
+				if (has_simd_regs)
+					goto next;
+			}
+			if (pred_mask) {
+				has_simd_regs = __parse_pred_regs(opts, s, intr);
+				if (has_simd_regs)
+					goto next;
+			}
+
 			for (r = arch__sample_reg_masks(); r->name; r++) {
 				if ((r->mask & mask) && !strcasecmp(s, r->name))
 					break;
@@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 			}
 
 			*mode |= r->mask;
-
+next:
 			if (!p)
 				break;
 
@@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 	ret = 0;
 
 	/* default to all possible regs */
-	if (*mode == 0)
+	if (*mode == 0 && !has_simd_regs)
 		*mode = mask;
 error:
 	free(os);
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 66b666d9ce64..fb0366d050cf 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(aux_start_paused, p_unsigned);
 	PRINT_ATTRf(aux_pause, p_unsigned);
 	PRINT_ATTRf(aux_resume, p_unsigned);
+	PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
+	PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
+	PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
+	PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
+	PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
+	PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
 
 	return ret;
 }
diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
index 44b90bbf2d07..e8a9fabc92e6 100644
--- a/tools/perf/util/perf_regs.c
+++ b/tools/perf/util/perf_regs.c
@@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
 	return SDT_ARG_SKIP;
 }
 
+bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
+{
+	return false;
+}
+
 uint64_t __weak arch__intr_reg_mask(void)
 {
 	return 0;
@@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
 	return 0;
 }
 
+uint64_t __weak arch__intr_simd_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__user_simd_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__intr_pred_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__user_pred_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
 static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG_END
 };
@@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
 	return sample_reg_masks;
 }
 
+const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
+{
+	return sample_reg_masks;
+}
+
+const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
+{
+	return sample_reg_masks;
+}
+
 const char *perf_reg_name(int id, const char *arch)
 {
 	const char *reg_name = NULL;
diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
index f2d0736d65cc..bce9c4cfd1bf 100644
--- a/tools/perf/util/perf_regs.h
+++ b/tools/perf/util/perf_regs.h
@@ -24,9 +24,20 @@ enum {
 };
 
 int arch_sdt_arg_parse_op(char *old_op, char **new_op);
+bool arch_has_simd_regs(u64 mask);
 uint64_t arch__intr_reg_mask(void);
 uint64_t arch__user_reg_mask(void);
 const struct sample_reg *arch__sample_reg_masks(void);
+const struct sample_reg *arch__sample_simd_reg_masks(void);
+const struct sample_reg *arch__sample_pred_reg_masks(void);
+uint64_t arch__intr_simd_reg_mask(void);
+uint64_t arch__user_simd_reg_mask(void);
+uint64_t arch__intr_pred_reg_mask(void);
+uint64_t arch__user_pred_reg_mask(void);
+uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
+uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
+uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
+uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
 
 const char *perf_reg_name(int id, const char *arch);
 int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index ea3a6c4657ee..825ffb4cc53f 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -59,7 +59,13 @@ struct record_opts {
 	unsigned int  user_freq;
 	u64	      branch_stack;
 	u64	      sample_intr_regs;
+	u64	      sample_intr_vec_regs;
 	u64	      sample_user_regs;
+	u64	      sample_user_vec_regs;
+	u16	      sample_pred_regs_qwords;
+	u16	      sample_vec_regs_qwords;
+	u16	      sample_intr_pred_regs;
+	u16	      sample_user_pred_regs;
 	u64	      default_interval;
 	u64	      user_interval;
 	size_t	      auxtrace_snapshot_size;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v5 19/19] perf regs: Enable dumping of SIMD registers
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (17 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format Dapeng Mi
@ 2025-12-03  6:55 ` Dapeng Mi
  2025-12-04  0:24 ` [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Ian Rogers
  19 siblings, 0 replies; 55+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:55 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for dumping SIMD registers using the new
PERF_SAMPLE_REGS_ABI_SIMD ABI.

Currently, the XMM, YMM, ZMM, OPMASK, eGPRs, and SSP registers on x86
platforms are supported with the PERF_SAMPLE_REGS_ABI_SIMD ABI.

An example of the output is displayed below.

Example:

 $perf record -e cycles:p -IXMM,YMM,OPMASK,SSP ./test
 $perf report -D
 ... ...
 237538985992962 0x454d0 [0x480]: PERF_RECORD_SAMPLE(IP, 0x1):
 179370/179370: 0xffffffff969627fc period: 124999 addr: 0
 ... intr regs: mask 0x20000000000 ABI 64-bit
 .... SSP   0x0000000000000000
 ... SIMD ABI nr_vectors 32 vector_qwords 4 nr_pred 8 pred_qwords 1
 .... YMM  [0] 0x0000000000004000
 .... YMM  [0] 0x000055e828695270
 .... YMM  [0] 0x0000000000000000
 .... YMM  [0] 0x0000000000000000
 .... YMM  [1] 0x000055e8286990e0
 .... YMM  [1] 0x000055e828698dd0
 .... YMM  [1] 0x0000000000000000
 .... YMM  [1] 0x0000000000000000
 ... ...
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... OPMASK[0] 0x0000000000100221
 .... OPMASK[1] 0x0000000000000020
 .... OPMASK[2] 0x000000007fffffff
 .... OPMASK[3] 0x0000000000000000
 .... OPMASK[4] 0x0000000000000000
 .... OPMASK[5] 0x0000000000000000
 .... OPMASK[6] 0x0000000000000000
 .... OPMASK[7] 0x0000000000000000
 ... ...

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/perf/util/evsel.c                       | 20 +++++
 .../perf/util/perf-regs-arch/perf_regs_x86.c  | 43 ++++++++++
 tools/perf/util/sample.h                      | 10 +++
 tools/perf/util/session.c                     | 78 +++++++++++++++++--
 4 files changed, 143 insertions(+), 8 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 5d1d90cf9488..8f3fafe3a43f 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -3347,6 +3347,16 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 			regs->mask = mask;
 			regs->regs = (u64 *)array;
 			array = (void *)array + sz;
+
+			if (regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				regs->config = *(u64 *)array;
+				array = (void *)array + sizeof(u64);
+				regs->data = (u64 *)array;
+				sz = (regs->nr_vectors * regs->vector_qwords +
+				      regs->nr_pred * regs->pred_qwords) * sizeof(u64);
+				OVERFLOW_CHECK(array, sz, max_size);
+				array = (void *)array + sz;
+			}
 		}
 	}
 
@@ -3404,6 +3414,16 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 			regs->mask = mask;
 			regs->regs = (u64 *)array;
 			array = (void *)array + sz;
+
+			if (regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				regs->config = *(u64 *)array;
+				array = (void *)array + sizeof(u64);
+				regs->data = (u64 *)array;
+				sz = (regs->nr_vectors * regs->vector_qwords +
+				      regs->nr_pred * regs->pred_qwords) * sizeof(u64);
+				OVERFLOW_CHECK(array, sz, max_size);
+				array = (void *)array + sz;
+			}
 		}
 	}
 
diff --git a/tools/perf/util/perf-regs-arch/perf_regs_x86.c b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
index 708954a9d35d..32dac438b12d 100644
--- a/tools/perf/util/perf-regs-arch/perf_regs_x86.c
+++ b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
@@ -5,6 +5,49 @@
 
 const char *__perf_reg_name_x86(int id)
 {
+	if (id > PERF_REG_X86_R15 && arch__intr_simd_reg_mask()) {
+		switch (id) {
+		case PERF_REG_X86_R16:
+			return "R16";
+		case PERF_REG_X86_R17:
+			return "R17";
+		case PERF_REG_X86_R18:
+			return "R18";
+		case PERF_REG_X86_R19:
+			return "R19";
+		case PERF_REG_X86_R20:
+			return "R20";
+		case PERF_REG_X86_R21:
+			return "R21";
+		case PERF_REG_X86_R22:
+			return "R22";
+		case PERF_REG_X86_R23:
+			return "R23";
+		case PERF_REG_X86_R24:
+			return "R24";
+		case PERF_REG_X86_R25:
+			return "R25";
+		case PERF_REG_X86_R26:
+			return "R26";
+		case PERF_REG_X86_R27:
+			return "R27";
+		case PERF_REG_X86_R28:
+			return "R28";
+		case PERF_REG_X86_R29:
+			return "R29";
+		case PERF_REG_X86_R30:
+			return "R30";
+		case PERF_REG_X86_R31:
+			return "R31";
+		case PERF_REG_X86_SSP:
+			return "SSP";
+		default:
+			return NULL;
+		}
+
+		return NULL;
+	}
+
 	switch (id) {
 	case PERF_REG_X86_AX:
 		return "AX";
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index fae834144ef4..3b247e0e8242 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -12,6 +12,16 @@ struct regs_dump {
 	u64 abi;
 	u64 mask;
 	u64 *regs;
+	union {
+		u64 config;
+		struct {
+			u16 nr_vectors;
+			u16 vector_qwords;
+			u16 nr_pred;
+			u16 pred_qwords;
+		};
+	};
+	u64 *data;
 
 	/* Cached values/mask filled by first register access. */
 	u64 cache_regs[PERF_SAMPLE_REGS_CACHE_SIZE];
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 09af486c83e4..c692be265c21 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -927,18 +927,78 @@ static void regs_dump__printf(u64 mask, u64 *regs, const char *arch)
 	}
 }
 
-static const char *regs_abi[] = {
-	[PERF_SAMPLE_REGS_ABI_NONE] = "none",
-	[PERF_SAMPLE_REGS_ABI_32] = "32-bit",
-	[PERF_SAMPLE_REGS_ABI_64] = "64-bit",
-};
+static void simd_regs_dump__printf(struct regs_dump *regs, bool intr)
+{
+	const char *name = "unknown";
+	const struct sample_reg *r;
+	int i, idx = 0;
+	u16 qwords;
+	int reg_idx;
+
+	if (!(regs->abi & PERF_SAMPLE_REGS_ABI_SIMD))
+		return;
+
+	printf("... SIMD ABI nr_vectors %d vector_qwords %d nr_pred %d pred_qwords %d\n",
+	       regs->nr_vectors, regs->vector_qwords,
+	       regs->nr_pred, regs->pred_qwords);
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		if (!fls64(r->mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		if (regs->vector_qwords == qwords) {
+			name = r->name;
+			break;
+		}
+	}
+
+	for (i = 0; i < regs->nr_vectors; i++) {
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		if (regs->vector_qwords > 2) {
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		}
+		if (regs->vector_qwords > 4) {
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		}
+	}
+
+	name = "unknown";
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		if (!fls64(r->mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		if (regs->pred_qwords == qwords) {
+			name = r->name;
+			break;
+		}
+	}
+	for (i = 0; i < regs->nr_pred; i++)
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+}
 
 static inline const char *regs_dump_abi(struct regs_dump *d)
 {
-	if (d->abi > PERF_SAMPLE_REGS_ABI_64)
-		return "unknown";
+	if (!d->abi)
+		return "none";
+	if (d->abi & PERF_SAMPLE_REGS_ABI_32)
+		return "32-bit";
+	else if (d->abi & PERF_SAMPLE_REGS_ABI_64)
+		return "64-bit";
 
-	return regs_abi[d->abi];
+	return "unknown";
 }
 
 static void regs__printf(const char *type, struct regs_dump *regs, const char *arch)
@@ -964,6 +1024,7 @@ static void regs_user__printf(struct perf_sample *sample, const char *arch)
 
 	if (user_regs->regs)
 		regs__printf("user", user_regs, arch);
+	simd_regs_dump__printf(user_regs, false);
 }
 
 static void regs_intr__printf(struct perf_sample *sample, const char *arch)
@@ -977,6 +1038,7 @@ static void regs_intr__printf(struct perf_sample *sample, const char *arch)
 
 	if (intr_regs->regs)
 		regs__printf("intr", intr_regs, arch);
+	simd_regs_dump__printf(intr_regs, true);
 }
 
 static void stack_user__printf(struct stack_dump *dump)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2025-12-03  6:54 ` [Patch v5 17/19] perf headers: Sync with the kernel headers Dapeng Mi
@ 2025-12-03 23:43   ` Ian Rogers
  2025-12-04  1:37     ` Mi, Dapeng
  0 siblings, 1 reply; 55+ messages in thread
From: Ian Rogers @ 2025-12-03 23:43 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Update include/uapi/linux/perf_event.h and
> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
>  2 files changed, 103 insertions(+), 4 deletions(-)
>
> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
> index 7c9d2bb3833b..f3561ed10041 100644
> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
>         PERF_REG_X86_R13,
>         PERF_REG_X86_R14,
>         PERF_REG_X86_R15,
> +       /*
> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
> +        *
> +        * Extended GPRs (EGPRs)
> +        */
> +       PERF_REG_X86_R16,
> +       PERF_REG_X86_R17,
> +       PERF_REG_X86_R18,
> +       PERF_REG_X86_R19,
> +       PERF_REG_X86_R20,
> +       PERF_REG_X86_R21,
> +       PERF_REG_X86_R22,
> +       PERF_REG_X86_R23,
> +       PERF_REG_X86_R24,
> +       PERF_REG_X86_R25,
> +       PERF_REG_X86_R26,
> +       PERF_REG_X86_R27,
> +       PERF_REG_X86_R28,
> +       PERF_REG_X86_R29,
> +       PERF_REG_X86_R30,
> +       PERF_REG_X86_R31,
> +       PERF_REG_X86_SSP,
>         /* These are the limits for the GPRs. */
>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,

I wonder MISC isn't the most intention revealing name. What happens if
things are extended again? Would APX be a better alternative, so
PERF_REG_APX_MAX ?

>
>         /* These all need two bits set because they are 128bit */
>         PERF_REG_X86_XMM0  = 32,
> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
>  };
>
>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
> +
> +enum {
> +       PERF_REG_X86_XMM,
> +       PERF_REG_X86_YMM,
> +       PERF_REG_X86_ZMM,
> +       PERF_REG_X86_MAX_SIMD_REGS,
> +
> +       PERF_REG_X86_OPMASK = 0,
> +       PERF_REG_X86_MAX_PRED_REGS = 1,
> +};
> +
> +enum {
> +       PERF_X86_SIMD_XMM_REGS      = 16,
> +       PERF_X86_SIMD_YMM_REGS      = 16,
> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
> +       PERF_X86_SIMD_ZMM_REGS      = 32,
> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
> +
> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
> +};
> +
> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
> +
> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
> +
> +enum {
> +       PERF_X86_OPMASK_QWORDS   = 1,
> +       PERF_X86_XMM_QWORDS      = 2,
> +       PERF_X86_YMMH_QWORDS     = 2,
> +       PERF_X86_YMM_QWORDS      = 4,
> +       PERF_X86_ZMMH_QWORDS     = 4,
> +       PERF_X86_ZMM_QWORDS      = 8,
> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
> +};
>
>  #endif /* _ASM_X86_PERF_REGS_H */
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index d292f96bc06f..f1474da32622 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -314,8 +314,9 @@ enum {
>   */
>  enum perf_sample_regs_abi {
>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
>  };
>
>  /*
> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */

ARM have added a config4 in:
https://lore.kernel.org/lkml/20251111-james-perf-feat_spe_eft-v10-1-1e1b5bf2cd05@linaro.org/
so this will need to be VER10.

Thanks,
Ian

>
>  /*
>   * 'struct perf_event_attr' contains various attributes that define
> @@ -545,6 +547,25 @@ struct perf_event_attr {
>         __u64   sig_data;
>
>         __u64   config3; /* extension of config2 */
> +
> +
> +       /*
> +        * Defines set of SIMD registers to dump on samples.
> +        * The sample_simd_regs_enabled !=0 implies the
> +        * set of SIMD registers is used to config all SIMD registers.
> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> +        * config some SIMD registers on X86.
> +        */
> +       union {
> +               __u16 sample_simd_regs_enabled;
> +               __u16 sample_simd_pred_reg_qwords;
> +       };
> +       __u32 sample_simd_pred_reg_intr;
> +       __u32 sample_simd_pred_reg_user;
> +       __u16 sample_simd_vec_reg_qwords;
> +       __u64 sample_simd_vec_reg_intr;
> +       __u64 sample_simd_vec_reg_user;
> +       __u32 __reserved_4;
>  };
>
>  /*
> @@ -1018,7 +1039,15 @@ enum perf_event_type {
>          *      } && PERF_SAMPLE_BRANCH_STACK
>          *
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_USER
>          *
>          *      { u64                   size;
>          *        char                  data[size];
> @@ -1045,7 +1074,15 @@ enum perf_event_type {
>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_INTR
>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-03  6:54 ` [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format Dapeng Mi
@ 2025-12-04  0:17   ` Ian Rogers
  2025-12-04  2:58     ` Mi, Dapeng
  0 siblings, 1 reply; 55+ messages in thread
From: Ian Rogers @ 2025-12-04  0:17 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> This patch adds support for the newly introduced SIMD register sampling
> format by adding the following functions:
>
> uint64_t arch__intr_simd_reg_mask(void);
> uint64_t arch__user_simd_reg_mask(void);
> uint64_t arch__intr_pred_reg_mask(void);
> uint64_t arch__user_pred_reg_mask(void);
> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>
> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>
> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> supported PRED registers, such as OPMASK on x86 platforms.
>
> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> exact bitmap and number of qwords for a specific type of SIMD register.
> For example, for XMM registers on x86 platforms, the returned bitmap is
> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>
> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> exact bitmap and number of qwords for a specific type of PRED register.
> For example, for OPMASK registers on x86 platforms, the returned bitmap
> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> OPMASK).
>
> Additionally, the function __parse_regs() is enhanced to support parsing
> these newly introduced SIMD registers. Currently, each type of register
> can only be sampled collectively; sampling a specific SIMD register is
> not supported. For example, all XMM registers are sampled together rather
> than sampling only XMM0.
>
> When multiple overlapping register types, such as XMM and YMM, are
> sampled simultaneously, only the superset (YMM registers) is sampled.
>
> With this patch, all supported sampling registers on x86 platforms are
> displayed as follows.
>
>  $perf record -I?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
>  $perf record --user-regs=?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>  tools/perf/util/evsel.c                   |  27 ++
>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>  tools/perf/util/perf_regs.c               |  59 +++
>  tools/perf/util/perf_regs.h               |  11 +
>  tools/perf/util/record.h                  |   6 +
>  7 files changed, 714 insertions(+), 16 deletions(-)
>
> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> index 12fd93f04802..db41430f3b07 100644
> --- a/tools/perf/arch/x86/util/perf_regs.c
> +++ b/tools/perf/arch/x86/util/perf_regs.c
> @@ -13,6 +13,49 @@
>  #include "../../../util/pmu.h"
>  #include "../../../util/pmus.h"
>
> +static const struct sample_reg sample_reg_masks_ext[] = {
> +       SMPL_REG(AX, PERF_REG_X86_AX),
> +       SMPL_REG(BX, PERF_REG_X86_BX),
> +       SMPL_REG(CX, PERF_REG_X86_CX),
> +       SMPL_REG(DX, PERF_REG_X86_DX),
> +       SMPL_REG(SI, PERF_REG_X86_SI),
> +       SMPL_REG(DI, PERF_REG_X86_DI),
> +       SMPL_REG(BP, PERF_REG_X86_BP),
> +       SMPL_REG(SP, PERF_REG_X86_SP),
> +       SMPL_REG(IP, PERF_REG_X86_IP),
> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> +       SMPL_REG(CS, PERF_REG_X86_CS),
> +       SMPL_REG(SS, PERF_REG_X86_SS),
> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> +       SMPL_REG(R8, PERF_REG_X86_R8),
> +       SMPL_REG(R9, PERF_REG_X86_R9),
> +       SMPL_REG(R10, PERF_REG_X86_R10),
> +       SMPL_REG(R11, PERF_REG_X86_R11),
> +       SMPL_REG(R12, PERF_REG_X86_R12),
> +       SMPL_REG(R13, PERF_REG_X86_R13),
> +       SMPL_REG(R14, PERF_REG_X86_R14),
> +       SMPL_REG(R15, PERF_REG_X86_R15),
> +       SMPL_REG(R16, PERF_REG_X86_R16),
> +       SMPL_REG(R17, PERF_REG_X86_R17),
> +       SMPL_REG(R18, PERF_REG_X86_R18),
> +       SMPL_REG(R19, PERF_REG_X86_R19),
> +       SMPL_REG(R20, PERF_REG_X86_R20),
> +       SMPL_REG(R21, PERF_REG_X86_R21),
> +       SMPL_REG(R22, PERF_REG_X86_R22),
> +       SMPL_REG(R23, PERF_REG_X86_R23),
> +       SMPL_REG(R24, PERF_REG_X86_R24),
> +       SMPL_REG(R25, PERF_REG_X86_R25),
> +       SMPL_REG(R26, PERF_REG_X86_R26),
> +       SMPL_REG(R27, PERF_REG_X86_R27),
> +       SMPL_REG(R28, PERF_REG_X86_R28),
> +       SMPL_REG(R29, PERF_REG_X86_R29),
> +       SMPL_REG(R30, PERF_REG_X86_R30),
> +       SMPL_REG(R31, PERF_REG_X86_R31),
> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> +#endif
> +       SMPL_REG_END
> +};
> +
>  static const struct sample_reg sample_reg_masks[] = {
>         SMPL_REG(AX, PERF_REG_X86_AX),
>         SMPL_REG(BX, PERF_REG_X86_BX),
> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>         return SDT_ARG_VALID;
>  }
>
> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)

To make the code easier to read, it'd be nice to document sample_type,
qwords and mask here.

> +{
> +       struct perf_event_attr attr = {
> +               .type                           = PERF_TYPE_HARDWARE,
> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> +               .sample_type                    = sample_type,
> +               .disabled                       = 1,
> +               .exclude_kernel                 = 1,
> +               .sample_simd_regs_enabled       = 1,
> +       };
> +       int fd;
> +
> +       attr.sample_period = 1;
> +
> +       if (!pred) {
> +               attr.sample_simd_vec_reg_qwords = qwords;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       attr.sample_simd_vec_reg_intr = mask;
> +               else
> +                       attr.sample_simd_vec_reg_user = mask;
> +       } else {
> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> +               else
> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> +       }
> +
> +       if (perf_pmus__num_core_pmus() > 1) {
> +               struct perf_pmu *pmu = NULL;
> +               __u64 type = PERF_TYPE_RAW;

It should be okay to do:
__u64 type = perf_pmus__find_core_pmu()->type
rather than have the whole loop below.

> +
> +               /*
> +                * The same register set is supported among different hybrid PMUs.
> +                * Only check the first available one.
> +                */
> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> +                       type = pmu->type;
> +                       break;
> +               }
> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> +       }
> +
> +       event_attr_init(&attr);
> +
> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> +       if (fd != -1) {
> +               close(fd);
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> +{
> +       bool supported = false;
> +       u64 bits;
> +
> +       *mask = 0;
> +       *qwords = 0;
> +
> +       switch (reg) {
> +       case PERF_REG_X86_XMM:
> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_XMM_QWORDS;
> +               }
> +               break;
> +       case PERF_REG_X86_YMM:
> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_YMM_QWORDS;
> +               }
> +               break;
> +       case PERF_REG_X86_ZMM:
> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_ZMM_QWORDS;
> +                       break;
> +               }
> +
> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> +               }
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       return supported;
> +}
> +
> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> +{
> +       bool supported = false;
> +       u64 bits;
> +
> +       *mask = 0;
> +       *qwords = 0;
> +
> +       switch (reg) {
> +       case PERF_REG_X86_OPMASK:
> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> +               }
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       return supported;
> +}
> +
> +static bool has_cap_simd_regs(void)
> +{
> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> +       u16 qwords = PERF_X86_XMM_QWORDS;
> +       static bool has_cap_simd_regs;
> +       static bool cached;
> +
> +       if (cached)
> +               return has_cap_simd_regs;
> +
> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> +       cached = true;
> +
> +       return has_cap_simd_regs;
> +}
> +
> +bool arch_has_simd_regs(u64 mask)
> +{
> +       return has_cap_simd_regs() &&
> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> +}
> +
> +static const struct sample_reg sample_simd_reg_masks[] = {
> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> +       SMPL_REG_END
> +};
> +
> +static const struct sample_reg sample_pred_reg_masks[] = {
> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> +       SMPL_REG_END
> +};
> +
> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> +{
> +       return sample_simd_reg_masks;
> +}
> +
> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> +{
> +       return sample_pred_reg_masks;
> +}
> +
> +static bool x86_intr_simd_updated;
> +static u64 x86_intr_simd_reg_mask;
> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];

Could we add some comments? I can kind of figure out the updated is a
check for lazy initialization and what masks are, qwords is an odd
one. The comment could also point out that SIMD doesn't mean the
machine supports SIMD, but SIMD registers are supported in perf
events.

> +static bool x86_user_simd_updated;
> +static u64 x86_user_simd_reg_mask;
> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> +
> +static bool x86_intr_pred_updated;
> +static u64 x86_intr_pred_reg_mask;
> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> +static bool x86_user_pred_updated;
> +static u64 x86_user_pred_reg_mask;
> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> +
> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool supported;
> +       u64 mask = 0;
> +       int reg;
> +
> +       if (!has_cap_simd_regs())
> +               return 0;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> +               return x86_intr_simd_reg_mask;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> +               return x86_user_simd_reg_mask;
> +
> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +               supported = false;
> +
> +               if (!r->mask)
> +                       continue;
> +               reg = fls64(r->mask) - 1;
> +
> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> +                       break;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> +                                                        &x86_intr_simd_mask[reg],
> +                                                        &x86_intr_simd_qwords[reg]);
> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> +                                                        &x86_user_simd_mask[reg],
> +                                                        &x86_user_simd_qwords[reg]);
> +               if (supported)
> +                       mask |= BIT_ULL(reg);
> +       }
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> +               x86_intr_simd_reg_mask = mask;
> +               x86_intr_simd_updated = true;
> +       } else {
> +               x86_user_simd_reg_mask = mask;
> +               x86_user_simd_updated = true;
> +       }
> +
> +       return mask;
> +}
> +
> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool supported;
> +       u64 mask = 0;
> +       int reg;
> +
> +       if (!has_cap_simd_regs())
> +               return 0;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> +               return x86_intr_pred_reg_mask;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> +               return x86_user_pred_reg_mask;
> +
> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> +               supported = false;
> +
> +               if (!r->mask)
> +                       continue;
> +               reg = fls64(r->mask) - 1;
> +
> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> +                       break;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> +                                                        &x86_intr_pred_mask[reg],
> +                                                        &x86_intr_pred_qwords[reg]);
> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> +                                                        &x86_user_pred_mask[reg],
> +                                                        &x86_user_pred_qwords[reg]);
> +               if (supported)
> +                       mask |= BIT_ULL(reg);
> +       }
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> +               x86_intr_pred_reg_mask = mask;
> +               x86_intr_pred_updated = true;
> +       } else {
> +               x86_user_pred_reg_mask = mask;
> +               x86_user_pred_updated = true;
> +       }
> +
> +       return mask;
> +}

This feels repetitive with __arch__simd_reg_mask, could they be
refactored together?

> +
> +uint64_t arch__intr_simd_reg_mask(void)
> +{
> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> +}
> +
> +uint64_t arch__user_simd_reg_mask(void)
> +{
> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> +}
> +
> +uint64_t arch__intr_pred_reg_mask(void)
> +{
> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> +}
> +
> +uint64_t arch__user_pred_reg_mask(void)
> +{
> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> +}
> +
> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> +{
> +       uint64_t mask = 0;
> +
> +       *qwords = 0;
> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> +               if (intr) {
> +                       *qwords = x86_intr_simd_qwords[reg];
> +                       mask = x86_intr_simd_mask[reg];
> +               } else {
> +                       *qwords = x86_user_simd_qwords[reg];
> +                       mask = x86_user_simd_mask[reg];
> +               }
> +       }
> +
> +       return mask;
> +}
> +
> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> +{
> +       uint64_t mask = 0;
> +
> +       *qwords = 0;
> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> +               if (intr) {
> +                       *qwords = x86_intr_pred_qwords[reg];
> +                       mask = x86_intr_pred_mask[reg];
> +               } else {
> +                       *qwords = x86_user_pred_qwords[reg];
> +                       mask = x86_user_pred_mask[reg];
> +               }
> +       }
> +
> +       return mask;
> +}
> +
> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_intr_simd_updated)
> +               arch__intr_simd_reg_mask();
> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> +}
> +
> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_user_simd_updated)
> +               arch__user_simd_reg_mask();
> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> +}
> +
> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_intr_pred_updated)
> +               arch__intr_pred_reg_mask();
> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> +}
> +
> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_user_pred_updated)
> +               arch__user_pred_reg_mask();
> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> +}
> +
>  const struct sample_reg *arch__sample_reg_masks(void)
>  {
> +       if (has_cap_simd_regs())
> +               return sample_reg_masks_ext;
>         return sample_reg_masks;
>  }
>
> -uint64_t arch__intr_reg_mask(void)
> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>  {
>         struct perf_event_attr attr = {
> -               .type                   = PERF_TYPE_HARDWARE,
> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> -               .precise_ip             = 1,
> -               .disabled               = 1,
> -               .exclude_kernel         = 1,
> +               .type                           = PERF_TYPE_HARDWARE,
> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> +               .sample_type                    = sample_type,
> +               .precise_ip                     = 1,
> +               .disabled                       = 1,
> +               .exclude_kernel                 = 1,
> +               .sample_simd_regs_enabled       = has_simd_regs,
>         };
>         int fd;
>         /*
>          * In an unnamed union, init it here to build on older gcc versions
>          */
>         attr.sample_period = 1;
> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> +               attr.sample_regs_intr = mask;
> +       else
> +               attr.sample_regs_user = mask;
>
>         if (perf_pmus__num_core_pmus() > 1) {
>                 struct perf_pmu *pmu = NULL;
> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>         if (fd != -1) {
>                 close(fd);
> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> +               return mask;
>         }
>
> -       return PERF_REGS_MASK;
> +       return 0;
> +}
> +
> +uint64_t arch__intr_reg_mask(void)
> +{
> +       uint64_t mask = PERF_REGS_MASK;
> +
> +       if (has_cap_simd_regs()) {
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> +                                        true);

It's nice to label constant arguments like this something like:
/*has_simd_regs=*/true);

Tools like clang-tidy even try to enforce the argument names match the comments.

> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> +                                        BIT_ULL(PERF_REG_X86_SSP),
> +                                        true);
> +       } else
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> +
> +       return mask;
>  }
>
>  uint64_t arch__user_reg_mask(void)
>  {
> -       return PERF_REGS_MASK;
> +       uint64_t mask = PERF_REGS_MASK;
> +
> +       if (has_cap_simd_regs()) {
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> +                                        true);
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> +                                        BIT_ULL(PERF_REG_X86_SSP),
> +                                        true);
> +       }
> +
> +       return mask;

The code is repetitive here, could we refactor into a single function
passing in a user or instr value?

>  }
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 56ebefd075f2..5d1d90cf9488 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>             !evsel__is_dummy_event(evsel)) {
>                 attr->sample_regs_intr = opts->sample_intr_regs;
> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> +               evsel__set_sample_bit(evsel, REGS_INTR);
> +       }
> +
> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> +               /* The pred qwords is to implies the set of SIMD registers is used */
> +               if (opts->sample_pred_regs_qwords)
> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> +               else
> +                       attr->sample_simd_pred_reg_qwords = 1;
> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>                 evsel__set_sample_bit(evsel, REGS_INTR);
>         }
>
>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>             !evsel__is_dummy_event(evsel)) {
>                 attr->sample_regs_user |= opts->sample_user_regs;
> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> +               evsel__set_sample_bit(evsel, REGS_USER);
> +       }
> +
> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> +               if (opts->sample_pred_regs_qwords)
> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> +               else
> +                       attr->sample_simd_pred_reg_qwords = 1;
> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>                 evsel__set_sample_bit(evsel, REGS_USER);
>         }
>
> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> index cda1c620968e..0bd100392889 100644
> --- a/tools/perf/util/parse-regs-options.c
> +++ b/tools/perf/util/parse-regs-options.c
> @@ -4,19 +4,139 @@
>  #include <stdint.h>
>  #include <string.h>
>  #include <stdio.h>
> +#include <linux/bitops.h>
>  #include "util/debug.h"
>  #include <subcmd/parse-options.h>
>  #include "util/perf_regs.h"
>  #include "util/parse-regs-options.h"
> +#include "record.h"
> +
> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> +{
> +       const struct sample_reg *r = NULL;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       if (!simd_mask)
> +               return;
> +
> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +               if (!(r->mask & simd_mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               if (bitmap)
> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> +       }
> +}
> +
> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> +{
> +       const struct sample_reg *r = NULL;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       if (!pred_mask)
> +               return;
> +
> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> +               if (!(r->mask & pred_mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               if (bitmap)
> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> +       }
> +}
> +
> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool matched = false;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +               if (strcasecmp(s, r->name))
> +                       continue;
> +               if (!fls64(r->mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               matched = true;
> +               break;
> +       }
> +
> +       /* Just need the highest qwords */

I'm not following here. Does the bitmap need to handle gaps?

> +       if (qwords > opts->sample_vec_regs_qwords) {
> +               opts->sample_vec_regs_qwords = qwords;
> +               if (intr)
> +                       opts->sample_intr_vec_regs = bitmap;
> +               else
> +                       opts->sample_user_vec_regs = bitmap;
> +       }
> +
> +       return matched;
> +}
> +
> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool matched = false;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> +               if (strcasecmp(s, r->name))
> +                       continue;
> +               if (!fls64(r->mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               matched = true;
> +               break;
> +       }
> +
> +       /* Just need the highest qwords */

Again repetitive, could we have a single function?

> +       if (qwords > opts->sample_pred_regs_qwords) {
> +               opts->sample_pred_regs_qwords = qwords;
> +               if (intr)
> +                       opts->sample_intr_pred_regs = bitmap;
> +               else
> +                       opts->sample_user_pred_regs = bitmap;
> +       }
> +
> +       return matched;
> +}
>
>  static int
>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>  {
>         uint64_t *mode = (uint64_t *)opt->value;
>         const struct sample_reg *r = NULL;
> +       struct record_opts *opts;
>         char *s, *os = NULL, *p;
> -       int ret = -1;
> +       bool has_simd_regs = false;
>         uint64_t mask;
> +       uint64_t simd_mask;
> +       uint64_t pred_mask;
> +       int ret = -1;
>
>         if (unset)
>                 return 0;
> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>         if (*mode)
>                 return -1;
>
> -       if (intr)
> +       if (intr) {
> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>                 mask = arch__intr_reg_mask();
> -       else
> +               simd_mask = arch__intr_simd_reg_mask();
> +               pred_mask = arch__intr_pred_reg_mask();
> +       } else {
> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>                 mask = arch__user_reg_mask();
> +               simd_mask = arch__user_simd_reg_mask();
> +               pred_mask = arch__user_pred_reg_mask();
> +       }
>
>         /* str may be NULL in case no arg is passed to -I */
>         if (str) {
> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>                                         if (r->mask & mask)
>                                                 fprintf(stderr, "%s ", r->name);
>                                 }
> +                               __print_simd_regs(intr, simd_mask);
> +                               __print_pred_regs(intr, pred_mask);
>                                 fputc('\n', stderr);
>                                 /* just printing available regs */
>                                 goto error;
>                         }
> +
> +                       if (simd_mask) {
> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> +                               if (has_simd_regs)
> +                                       goto next;
> +                       }
> +                       if (pred_mask) {
> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> +                               if (has_simd_regs)
> +                                       goto next;
> +                       }
> +
>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>                                         break;
> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>                         }
>
>                         *mode |= r->mask;
> -
> +next:
>                         if (!p)
>                                 break;
>
> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>         ret = 0;
>
>         /* default to all possible regs */
> -       if (*mode == 0)
> +       if (*mode == 0 && !has_simd_regs)
>                 *mode = mask;
>  error:
>         free(os);
> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> index 66b666d9ce64..fb0366d050cf 100644
> --- a/tools/perf/util/perf_event_attr_fprintf.c
> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>         PRINT_ATTRf(aux_pause, p_unsigned);
>         PRINT_ATTRf(aux_resume, p_unsigned);
> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>
>         return ret;
>  }
> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> index 44b90bbf2d07..e8a9fabc92e6 100644
> --- a/tools/perf/util/perf_regs.c
> +++ b/tools/perf/util/perf_regs.c
> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>         return SDT_ARG_SKIP;
>  }
>
> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> +{
> +       return false;
> +}
> +
>  uint64_t __weak arch__intr_reg_mask(void)
>  {
>         return 0;
> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>         return 0;
>  }
>
> +uint64_t __weak arch__intr_simd_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_simd_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__intr_pred_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_pred_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
>  static const struct sample_reg sample_reg_masks[] = {
>         SMPL_REG_END
>  };
> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>         return sample_reg_masks;
>  }
>
> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> +{
> +       return sample_reg_masks;
> +}
> +
> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> +{
> +       return sample_reg_masks;
> +}

Thinking out loud. I wonder if there is a way to hide the weak
functions. It seems the support is tied to PMUs, particularly core
PMUs, perhaps we can push things into pmu and arch pmu code. Then we
ask the PMU to parse the register strings, set up the perf_event_attr,
etc. I'm somewhat scared these functions will be used on the report
rather than record side of things, thereby breaking perf.data support
when the host kernel does or doesn't have the SIMD support.

Thanks,
Ian

> +
>  const char *perf_reg_name(int id, const char *arch)
>  {
>         const char *reg_name = NULL;
> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> index f2d0736d65cc..bce9c4cfd1bf 100644
> --- a/tools/perf/util/perf_regs.h
> +++ b/tools/perf/util/perf_regs.h
> @@ -24,9 +24,20 @@ enum {
>  };
>
>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> +bool arch_has_simd_regs(u64 mask);
>  uint64_t arch__intr_reg_mask(void);
>  uint64_t arch__user_reg_mask(void);
>  const struct sample_reg *arch__sample_reg_masks(void);
> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> +uint64_t arch__intr_simd_reg_mask(void);
> +uint64_t arch__user_simd_reg_mask(void);
> +uint64_t arch__intr_pred_reg_mask(void);
> +uint64_t arch__user_pred_reg_mask(void);
> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>
>  const char *perf_reg_name(int id, const char *arch);
>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> index ea3a6c4657ee..825ffb4cc53f 100644
> --- a/tools/perf/util/record.h
> +++ b/tools/perf/util/record.h
> @@ -59,7 +59,13 @@ struct record_opts {
>         unsigned int  user_freq;
>         u64           branch_stack;
>         u64           sample_intr_regs;
> +       u64           sample_intr_vec_regs;
>         u64           sample_user_regs;
> +       u64           sample_user_vec_regs;
> +       u16           sample_pred_regs_qwords;
> +       u16           sample_vec_regs_qwords;
> +       u16           sample_intr_pred_regs;
> +       u16           sample_user_pred_regs;
>         u64           default_interval;
>         u64           user_interval;
>         size_t        auxtrace_snapshot_size;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (18 preceding siblings ...)
  2025-12-03  6:55 ` [Patch v5 19/19] perf regs: Enable dumping of SIMD registers Dapeng Mi
@ 2025-12-04  0:24 ` Ian Rogers
  2025-12-04  3:28   ` Mi, Dapeng
  19 siblings, 1 reply; 55+ messages in thread
From: Ian Rogers @ 2025-12-04  0:24 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

On Tue, Dec 2, 2025 at 10:58 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>
> Changes since V4:
> - Rewrite some functions comments and commit messages (Dave)
> - Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
> - Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
>   activating back-to-back NMI detection mechanism (Patch 16/19)
> - Fix some minor issues on perf-tool patches (Patch 18/19)
>
> Changes since V3:
> - Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
> - Only dump the available regs, rather than zero and dump the
>   unavailable regs. It's possible that the dumped registers are a subset
>   of the requested registers.
> - Some minor updates to address Dapeng's comments in V3.
>
> Changes since V2:
> - Use the FPU format for the x86_pmu.ext_regs_mask as well
> - Add a check before invoking xsaves_nmi()
> - Add perf_simd_reg_check() to retrieve the number of available
>   registers. If the kernel fails to get the requested registers, e.g.,
>   XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
> - Add POC perf tool patches
>
> Changes since V1:
> - Apply the new interfaces to configure and dump the SIMD registers
> - Utilize the existing FPU functions, e.g., xstate_calculate_size,
>   get_xsave_addr().
>
> Starting from Intel Ice Lake, XMM registers can be collected in a PEBS
> record. Future Architecture PEBS will include additional registers such
> as YMM, ZMM, OPMASK, SSP and APX eGPRs, contingent on hardware support.
>
> This patch set introduces a software solution to mitigate the hardware
> requirement by utilizing the XSAVES command to retrieve the requested
> registers in the overflow handler. This feature is no longer limited to
> PEBS events or specific platforms. While the hardware solution remains
> preferable due to its lower overhead and higher accuracy, this software
> approach provides a viable alternative.
>
> The solution is theoretically compatible with all x86 platforms but is
> currently enabled on newer platforms, including Sapphire Rapids and
> later P-core server platforms, Sierra Forest and later E-core server
> platforms and recent Client platforms, like Arrow Lake, Panther Lake and
> Nova Lake.
>
> Newly supported registers include YMM, ZMM, OPMASK, SSP, and APX eGPRs.
> Due to space constraints in sample_regs_user/intr, new fields have been
> introduced in the perf_event_attr structure to accommodate these
> registers.
>
> After a long discussion in V1,
> https://lore.kernel.org/lkml/3f1c9a9e-cb63-47ff-a5e9-06555fa6cc9a@linux.intel.com/
> The below new fields are introduced.
>
> @@ -543,6 +545,25 @@ struct perf_event_attr {
>         __u64   sig_data;
>
>         __u64   config3; /* extension of config2 */
> +
> +
> +       /*
> +        * Defines set of SIMD registers to dump on samples.
> +        * The sample_simd_regs_enabled !=0 implies the
> +        * set of SIMD registers is used to config all SIMD registers.
> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> +        * config some SIMD registers on X86.
> +        */
> +       union {
> +               __u16 sample_simd_regs_enabled;
> +               __u16 sample_simd_pred_reg_qwords;
> +       };
> +       __u32 sample_simd_pred_reg_intr;
> +       __u32 sample_simd_pred_reg_user;
> +       __u16 sample_simd_vec_reg_qwords;
> +       __u64 sample_simd_vec_reg_intr;
> +       __u64 sample_simd_vec_reg_user;
> +       __u32 __reserved_4;
>  };
> @@ -1016,7 +1037,15 @@ enum perf_event_type {
>          *      } && PERF_SAMPLE_BRANCH_STACK
>          *
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_USER
>          *
>          *      { u64                   size;
>          *        char                  data[size];
> @@ -1043,7 +1072,15 @@ enum perf_event_type {
>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_INTR
>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>
>
> To maintain simplicity, a single field, sample_{simd|pred}_vec_reg_qwords,
> is introduced to indicate register width. For example:
> - sample_simd_vec_reg_qwords = 2 for XMM registers (128 bits) on x86
> - sample_simd_vec_reg_qwords = 4 for YMM registers (256 bits) on x86
>
> Four additional fields, sample_{simd|pred}_vec_reg_{intr|user}, represent
> the bitmap of sampling registers. For instance, the bitmap for x86
> XMM registers is 0xffff (16 XMM registers). Although users can
> theoretically sample a subset of registers, the current perf-tool
> implementation supports sampling all registers of each type to avoid
> complexity.
>
> A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is introduced to signal user space
> tools about the presence of SIMD registers in sampling records. When this
> flag is detected, tools should recognize that extra SIMD register data
> follows the general register data. The layout of the extra SIMD register
> data is displayed as follow.
>
>    u16 nr_vectors;
>    u16 vector_qwords;
>    u16 nr_pred;
>    u16 pred_qwords;
>    u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>
> With this patch set, sampling for the aforementioned registers is
> supported on the Intel Nova Lake platform.
>
> Examples:
>  $perf record -I?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

nit: It seems strange in this output to mix ranges like "XMM0-15" but
then list out "R8....R31". That said we have tests that explicitly
look for the non-range pattern:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell/record.sh?h=perf-tools-next#n106

Thanks,
Ian

>  $perf record --user-regs=?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
>  $perf record -e branches:p -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -c 100000 ./test
>  $perf report -D
>
>  ... ...
>  14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
>  0xffffffff9f085e24 period: 100000 addr: 0
>  ... intr regs: mask 0x18001010003 ABI 64-bit
>  .... AX    0xdffffc0000000000
>  .... BX    0xffff8882297685e8
>  .... R8    0x0000000000000000
>  .... R16   0x0000000000000000
>  .... R31   0x0000000000000000
>  .... SSP   0x0000000000000000
>  ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
>  .... ZMM  [0] 0xffffffffffffffff
>  .... ZMM  [0] 0x0000000000000001
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [1] 0x003a6b6165506d56
>  ... ...
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... OPMASK[0] 0x00000000fffffe00
>  .... OPMASK[1] 0x0000000000ffffff
>  .... OPMASK[2] 0x000000000000007f
>  .... OPMASK[3] 0x0000000000000000
>  .... OPMASK[4] 0x0000000000010080
>  .... OPMASK[5] 0x0000000000000000
>  .... OPMASK[6] 0x0000400004000000
>  .... OPMASK[7] 0x0000000000000000
>  ... ...
>
>
> History:
>   v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
>   v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
>   v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
>   v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/
>
> Dapeng Mi (3):
>   perf: Eliminate duplicate arch-specific functions definations
>   perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling
>   perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
>     NMIs
>
> Kan Liang (16):
>   perf/x86: Use x86_perf_regs in the x86 nmi handler
>   perf/x86: Introduce x86-specific x86_pmu_setup_regs_data()
>   x86/fpu/xstate: Add xsaves_nmi() helper
>   perf: Move and rename has_extended_regs() for ARCH-specific use
>   perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
>   perf: Add sampling support for SIMD registers
>   perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Enable YMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Enable ZMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields
>   perf/x86: Enable eGPRs sampling using sample_regs_* fields
>   perf/x86: Enable SSP sampling using sample_regs_* fields
>   perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
>   perf headers: Sync with the kernel headers
>   perf parse-regs: Support new SIMD sampling format
>   perf regs: Enable dumping of SIMD registers
>
>  arch/arm/kernel/perf_regs.c                   |   8 +-
>  arch/arm64/kernel/perf_regs.c                 |   8 +-
>  arch/csky/kernel/perf_regs.c                  |   8 +-
>  arch/loongarch/kernel/perf_regs.c             |   8 +-
>  arch/mips/kernel/perf_regs.c                  |   8 +-
>  arch/parisc/kernel/perf_regs.c                |   8 +-
>  arch/powerpc/perf/perf_regs.c                 |   2 +-
>  arch/riscv/kernel/perf_regs.c                 |   8 +-
>  arch/s390/kernel/perf_regs.c                  |   2 +-
>  arch/x86/events/core.c                        | 326 +++++++++++-
>  arch/x86/events/intel/core.c                  | 117 ++++-
>  arch/x86/events/intel/ds.c                    | 134 ++++-
>  arch/x86/events/perf_event.h                  |  85 +++-
>  arch/x86/include/asm/fpu/xstate.h             |   3 +
>  arch/x86/include/asm/msr-index.h              |   7 +
>  arch/x86/include/asm/perf_event.h             |  38 +-
>  arch/x86/include/uapi/asm/perf_regs.h         |  62 +++
>  arch/x86/kernel/fpu/xstate.c                  |  25 +-
>  arch/x86/kernel/perf_regs.c                   | 131 ++++-
>  include/linux/perf_event.h                    |  16 +
>  include/linux/perf_regs.h                     |  36 +-
>  include/uapi/linux/perf_event.h               |  45 +-
>  kernel/events/core.c                          | 132 ++++-
>  tools/arch/x86/include/uapi/asm/perf_regs.h   |  62 +++
>  tools/include/uapi/linux/perf_event.h         |  45 +-
>  tools/perf/arch/x86/util/perf_regs.c          | 470 +++++++++++++++++-
>  tools/perf/util/evsel.c                       |  47 ++
>  tools/perf/util/parse-regs-options.c          | 151 +++++-
>  .../perf/util/perf-regs-arch/perf_regs_x86.c  |  43 ++
>  tools/perf/util/perf_event_attr_fprintf.c     |   6 +
>  tools/perf/util/perf_regs.c                   |  59 +++
>  tools/perf/util/perf_regs.h                   |  11 +
>  tools/perf/util/record.h                      |   6 +
>  tools/perf/util/sample.h                      |  10 +
>  tools/perf/util/session.c                     |  78 ++-
>  35 files changed, 2012 insertions(+), 193 deletions(-)
>
>
> base-commit: 9929dffce5ed7e2988e0274f4db98035508b16d9
> prerequisite-patch-id: a15bcd62a8dcd219d17489eef88b66ea5488a2a0
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2025-12-03 23:43   ` Ian Rogers
@ 2025-12-04  1:37     ` Mi, Dapeng
  2025-12-04  7:28       ` Ian Rogers
  0 siblings, 1 reply; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-04  1:37 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/4/2025 7:43 AM, Ian Rogers wrote:
> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Update include/uapi/linux/perf_event.h and
>> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
>>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
>>  2 files changed, 103 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
>> index 7c9d2bb3833b..f3561ed10041 100644
>> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
>> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
>> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
>>         PERF_REG_X86_R13,
>>         PERF_REG_X86_R14,
>>         PERF_REG_X86_R15,
>> +       /*
>> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
>> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
>> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
>> +        *
>> +        * Extended GPRs (EGPRs)
>> +        */
>> +       PERF_REG_X86_R16,
>> +       PERF_REG_X86_R17,
>> +       PERF_REG_X86_R18,
>> +       PERF_REG_X86_R19,
>> +       PERF_REG_X86_R20,
>> +       PERF_REG_X86_R21,
>> +       PERF_REG_X86_R22,
>> +       PERF_REG_X86_R23,
>> +       PERF_REG_X86_R24,
>> +       PERF_REG_X86_R25,
>> +       PERF_REG_X86_R26,
>> +       PERF_REG_X86_R27,
>> +       PERF_REG_X86_R28,
>> +       PERF_REG_X86_R29,
>> +       PERF_REG_X86_R30,
>> +       PERF_REG_X86_R31,
>> +       PERF_REG_X86_SSP,
>>         /* These are the limits for the GPRs. */
>>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
>> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
> I wonder MISC isn't the most intention revealing name. What happens if
> things are extended again? Would APX be a better alternative, so
> PERF_REG_APX_MAX ?

Hmm, I don't think PERF_REG_APX_MAX is a good name either since there is
SSP as well besides the APX eGPRs, and there could be more registers
introduced in the future.

How about PERF_REG_X86_EXTD_MAX?


>
>>         /* These all need two bits set because they are 128bit */
>>         PERF_REG_X86_XMM0  = 32,
>> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
>>  };
>>
>>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
>> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
>> +
>> +enum {
>> +       PERF_REG_X86_XMM,
>> +       PERF_REG_X86_YMM,
>> +       PERF_REG_X86_ZMM,
>> +       PERF_REG_X86_MAX_SIMD_REGS,
>> +
>> +       PERF_REG_X86_OPMASK = 0,
>> +       PERF_REG_X86_MAX_PRED_REGS = 1,
>> +};
>> +
>> +enum {
>> +       PERF_X86_SIMD_XMM_REGS      = 16,
>> +       PERF_X86_SIMD_YMM_REGS      = 16,
>> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
>> +       PERF_X86_SIMD_ZMM_REGS      = 32,
>> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
>> +
>> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
>> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
>> +};
>> +
>> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
>> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
>> +
>> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
>> +
>> +enum {
>> +       PERF_X86_OPMASK_QWORDS   = 1,
>> +       PERF_X86_XMM_QWORDS      = 2,
>> +       PERF_X86_YMMH_QWORDS     = 2,
>> +       PERF_X86_YMM_QWORDS      = 4,
>> +       PERF_X86_ZMMH_QWORDS     = 4,
>> +       PERF_X86_ZMM_QWORDS      = 8,
>> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
>> +};
>>
>>  #endif /* _ASM_X86_PERF_REGS_H */
>> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
>> index d292f96bc06f..f1474da32622 100644
>> --- a/tools/include/uapi/linux/perf_event.h
>> +++ b/tools/include/uapi/linux/perf_event.h
>> @@ -314,8 +314,9 @@ enum {
>>   */
>>  enum perf_sample_regs_abi {
>>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
>> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
>> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
>> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
>> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
>> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
>>  };
>>
>>  /*
>> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
>>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
>>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
>> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
> ARM have added a config4 in:
> https://lore.kernel.org/lkml/20251111-james-perf-feat_spe_eft-v10-1-1e1b5bf2cd05@linaro.org/
> so this will need to be VER10.

Thanks. It looks the ARM changes have been merged, so we can change it to
VER10 in next version.


>
> Thanks,
> Ian
>
>>  /*
>>   * 'struct perf_event_attr' contains various attributes that define
>> @@ -545,6 +547,25 @@ struct perf_event_attr {
>>         __u64   sig_data;
>>
>>         __u64   config3; /* extension of config2 */
>> +
>> +
>> +       /*
>> +        * Defines set of SIMD registers to dump on samples.
>> +        * The sample_simd_regs_enabled !=0 implies the
>> +        * set of SIMD registers is used to config all SIMD registers.
>> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
>> +        * config some SIMD registers on X86.
>> +        */
>> +       union {
>> +               __u16 sample_simd_regs_enabled;
>> +               __u16 sample_simd_pred_reg_qwords;
>> +       };
>> +       __u32 sample_simd_pred_reg_intr;
>> +       __u32 sample_simd_pred_reg_user;
>> +       __u16 sample_simd_vec_reg_qwords;
>> +       __u64 sample_simd_vec_reg_intr;
>> +       __u64 sample_simd_vec_reg_user;
>> +       __u32 __reserved_4;
>>  };
>>
>>  /*
>> @@ -1018,7 +1039,15 @@ enum perf_event_type {
>>          *      } && PERF_SAMPLE_BRANCH_STACK
>>          *
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +        *      } && PERF_SAMPLE_REGS_USER
>>          *
>>          *      { u64                   size;
>>          *        char                  data[size];
>> @@ -1045,7 +1074,15 @@ enum perf_event_type {
>>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +        *      } && PERF_SAMPLE_REGS_INTR
>>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-04  0:17   ` Ian Rogers
@ 2025-12-04  2:58     ` Mi, Dapeng
  2025-12-04  7:49       ` Ian Rogers
  0 siblings, 1 reply; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-04  2:58 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/4/2025 8:17 AM, Ian Rogers wrote:
> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> This patch adds support for the newly introduced SIMD register sampling
>> format by adding the following functions:
>>
>> uint64_t arch__intr_simd_reg_mask(void);
>> uint64_t arch__user_simd_reg_mask(void);
>> uint64_t arch__intr_pred_reg_mask(void);
>> uint64_t arch__user_pred_reg_mask(void);
>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>
>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>
>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>> supported PRED registers, such as OPMASK on x86 platforms.
>>
>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>> exact bitmap and number of qwords for a specific type of SIMD register.
>> For example, for XMM registers on x86 platforms, the returned bitmap is
>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>
>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>> exact bitmap and number of qwords for a specific type of PRED register.
>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>> OPMASK).
>>
>> Additionally, the function __parse_regs() is enhanced to support parsing
>> these newly introduced SIMD registers. Currently, each type of register
>> can only be sampled collectively; sampling a specific SIMD register is
>> not supported. For example, all XMM registers are sampled together rather
>> than sampling only XMM0.
>>
>> When multiple overlapping register types, such as XMM and YMM, are
>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>
>> With this patch, all supported sampling registers on x86 platforms are
>> displayed as follows.
>>
>>  $perf record -I?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>
>>  $perf record --user-regs=?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>  tools/perf/util/evsel.c                   |  27 ++
>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>  tools/perf/util/perf_regs.c               |  59 +++
>>  tools/perf/util/perf_regs.h               |  11 +
>>  tools/perf/util/record.h                  |   6 +
>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>
>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>> index 12fd93f04802..db41430f3b07 100644
>> --- a/tools/perf/arch/x86/util/perf_regs.c
>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>> @@ -13,6 +13,49 @@
>>  #include "../../../util/pmu.h"
>>  #include "../../../util/pmus.h"
>>
>> +static const struct sample_reg sample_reg_masks_ext[] = {
>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>> +#endif
>> +       SMPL_REG_END
>> +};
>> +
>>  static const struct sample_reg sample_reg_masks[] = {
>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>         SMPL_REG(BX, PERF_REG_X86_BX),
>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>         return SDT_ARG_VALID;
>>  }
>>
>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> To make the code easier to read, it'd be nice to document sample_type,
> qwords and mask here.

Sure.


>
>> +{
>> +       struct perf_event_attr attr = {
>> +               .type                           = PERF_TYPE_HARDWARE,
>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>> +               .sample_type                    = sample_type,
>> +               .disabled                       = 1,
>> +               .exclude_kernel                 = 1,
>> +               .sample_simd_regs_enabled       = 1,
>> +       };
>> +       int fd;
>> +
>> +       attr.sample_period = 1;
>> +
>> +       if (!pred) {
>> +               attr.sample_simd_vec_reg_qwords = qwords;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       attr.sample_simd_vec_reg_intr = mask;
>> +               else
>> +                       attr.sample_simd_vec_reg_user = mask;
>> +       } else {
>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>> +               else
>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>> +       }
>> +
>> +       if (perf_pmus__num_core_pmus() > 1) {
>> +               struct perf_pmu *pmu = NULL;
>> +               __u64 type = PERF_TYPE_RAW;
> It should be okay to do:
> __u64 type = perf_pmus__find_core_pmu()->type
> rather than have the whole loop below.

Sure. Thanks.


>
>> +
>> +               /*
>> +                * The same register set is supported among different hybrid PMUs.
>> +                * Only check the first available one.
>> +                */
>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>> +                       type = pmu->type;
>> +                       break;
>> +               }
>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>> +       }
>> +
>> +       event_attr_init(&attr);
>> +
>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>> +       if (fd != -1) {
>> +               close(fd);
>> +               return true;
>> +       }
>> +
>> +       return false;
>> +}
>> +
>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>> +{
>> +       bool supported = false;
>> +       u64 bits;
>> +
>> +       *mask = 0;
>> +       *qwords = 0;
>> +
>> +       switch (reg) {
>> +       case PERF_REG_X86_XMM:
>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_XMM_QWORDS;
>> +               }
>> +               break;
>> +       case PERF_REG_X86_YMM:
>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_YMM_QWORDS;
>> +               }
>> +               break;
>> +       case PERF_REG_X86_ZMM:
>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>> +                       break;
>> +               }
>> +
>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>> +               }
>> +               break;
>> +       default:
>> +               break;
>> +       }
>> +
>> +       return supported;
>> +}
>> +
>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>> +{
>> +       bool supported = false;
>> +       u64 bits;
>> +
>> +       *mask = 0;
>> +       *qwords = 0;
>> +
>> +       switch (reg) {
>> +       case PERF_REG_X86_OPMASK:
>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>> +               }
>> +               break;
>> +       default:
>> +               break;
>> +       }
>> +
>> +       return supported;
>> +}
>> +
>> +static bool has_cap_simd_regs(void)
>> +{
>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>> +       static bool has_cap_simd_regs;
>> +       static bool cached;
>> +
>> +       if (cached)
>> +               return has_cap_simd_regs;
>> +
>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>> +       cached = true;
>> +
>> +       return has_cap_simd_regs;
>> +}
>> +
>> +bool arch_has_simd_regs(u64 mask)
>> +{
>> +       return has_cap_simd_regs() &&
>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>> +}
>> +
>> +static const struct sample_reg sample_simd_reg_masks[] = {
>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>> +       SMPL_REG_END
>> +};
>> +
>> +static const struct sample_reg sample_pred_reg_masks[] = {
>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>> +       SMPL_REG_END
>> +};
>> +
>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>> +{
>> +       return sample_simd_reg_masks;
>> +}
>> +
>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>> +{
>> +       return sample_pred_reg_masks;
>> +}
>> +
>> +static bool x86_intr_simd_updated;
>> +static u64 x86_intr_simd_reg_mask;
>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> Could we add some comments? I can kind of figure out the updated is a
> check for lazy initialization and what masks are, qwords is an odd
> one. The comment could also point out that SIMD doesn't mean the
> machine supports SIMD, but SIMD registers are supported in perf
> events.

Sure.


>
>> +static bool x86_user_simd_updated;
>> +static u64 x86_user_simd_reg_mask;
>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>> +
>> +static bool x86_intr_pred_updated;
>> +static u64 x86_intr_pred_reg_mask;
>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>> +static bool x86_user_pred_updated;
>> +static u64 x86_user_pred_reg_mask;
>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>> +
>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool supported;
>> +       u64 mask = 0;
>> +       int reg;
>> +
>> +       if (!has_cap_simd_regs())
>> +               return 0;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>> +               return x86_intr_simd_reg_mask;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>> +               return x86_user_simd_reg_mask;
>> +
>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>> +               supported = false;
>> +
>> +               if (!r->mask)
>> +                       continue;
>> +               reg = fls64(r->mask) - 1;
>> +
>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>> +                       break;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>> +                                                        &x86_intr_simd_mask[reg],
>> +                                                        &x86_intr_simd_qwords[reg]);
>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>> +                                                        &x86_user_simd_mask[reg],
>> +                                                        &x86_user_simd_qwords[reg]);
>> +               if (supported)
>> +                       mask |= BIT_ULL(reg);
>> +       }
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>> +               x86_intr_simd_reg_mask = mask;
>> +               x86_intr_simd_updated = true;
>> +       } else {
>> +               x86_user_simd_reg_mask = mask;
>> +               x86_user_simd_updated = true;
>> +       }
>> +
>> +       return mask;
>> +}
>> +
>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool supported;
>> +       u64 mask = 0;
>> +       int reg;
>> +
>> +       if (!has_cap_simd_regs())
>> +               return 0;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>> +               return x86_intr_pred_reg_mask;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>> +               return x86_user_pred_reg_mask;
>> +
>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>> +               supported = false;
>> +
>> +               if (!r->mask)
>> +                       continue;
>> +               reg = fls64(r->mask) - 1;
>> +
>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>> +                       break;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>> +                                                        &x86_intr_pred_mask[reg],
>> +                                                        &x86_intr_pred_qwords[reg]);
>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>> +                                                        &x86_user_pred_mask[reg],
>> +                                                        &x86_user_pred_qwords[reg]);
>> +               if (supported)
>> +                       mask |= BIT_ULL(reg);
>> +       }
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>> +               x86_intr_pred_reg_mask = mask;
>> +               x86_intr_pred_updated = true;
>> +       } else {
>> +               x86_user_pred_reg_mask = mask;
>> +               x86_user_pred_updated = true;
>> +       }
>> +
>> +       return mask;
>> +}
> This feels repetitive with __arch__simd_reg_mask, could they be
> refactored together?

hmm, it looks we can extract the for loop as a common function. The other
parts are hard to be generalized since they are manipulating different
variables. If we want to generalize them, we have to introduce lots of "if
... else" branches and that would make code hard to be read.


>
>> +
>> +uint64_t arch__intr_simd_reg_mask(void)
>> +{
>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>> +}
>> +
>> +uint64_t arch__user_simd_reg_mask(void)
>> +{
>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>> +}
>> +
>> +uint64_t arch__intr_pred_reg_mask(void)
>> +{
>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>> +}
>> +
>> +uint64_t arch__user_pred_reg_mask(void)
>> +{
>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>> +}
>> +
>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>> +{
>> +       uint64_t mask = 0;
>> +
>> +       *qwords = 0;
>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>> +               if (intr) {
>> +                       *qwords = x86_intr_simd_qwords[reg];
>> +                       mask = x86_intr_simd_mask[reg];
>> +               } else {
>> +                       *qwords = x86_user_simd_qwords[reg];
>> +                       mask = x86_user_simd_mask[reg];
>> +               }
>> +       }
>> +
>> +       return mask;
>> +}
>> +
>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>> +{
>> +       uint64_t mask = 0;
>> +
>> +       *qwords = 0;
>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>> +               if (intr) {
>> +                       *qwords = x86_intr_pred_qwords[reg];
>> +                       mask = x86_intr_pred_mask[reg];
>> +               } else {
>> +                       *qwords = x86_user_pred_qwords[reg];
>> +                       mask = x86_user_pred_mask[reg];
>> +               }
>> +       }
>> +
>> +       return mask;
>> +}
>> +
>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_intr_simd_updated)
>> +               arch__intr_simd_reg_mask();
>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>> +}
>> +
>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_user_simd_updated)
>> +               arch__user_simd_reg_mask();
>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>> +}
>> +
>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_intr_pred_updated)
>> +               arch__intr_pred_reg_mask();
>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>> +}
>> +
>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_user_pred_updated)
>> +               arch__user_pred_reg_mask();
>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>> +}
>> +
>>  const struct sample_reg *arch__sample_reg_masks(void)
>>  {
>> +       if (has_cap_simd_regs())
>> +               return sample_reg_masks_ext;
>>         return sample_reg_masks;
>>  }
>>
>> -uint64_t arch__intr_reg_mask(void)
>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>  {
>>         struct perf_event_attr attr = {
>> -               .type                   = PERF_TYPE_HARDWARE,
>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>> -               .precise_ip             = 1,
>> -               .disabled               = 1,
>> -               .exclude_kernel         = 1,
>> +               .type                           = PERF_TYPE_HARDWARE,
>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>> +               .sample_type                    = sample_type,
>> +               .precise_ip                     = 1,
>> +               .disabled                       = 1,
>> +               .exclude_kernel                 = 1,
>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>         };
>>         int fd;
>>         /*
>>          * In an unnamed union, init it here to build on older gcc versions
>>          */
>>         attr.sample_period = 1;
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +               attr.sample_regs_intr = mask;
>> +       else
>> +               attr.sample_regs_user = mask;
>>
>>         if (perf_pmus__num_core_pmus() > 1) {
>>                 struct perf_pmu *pmu = NULL;
>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>         if (fd != -1) {
>>                 close(fd);
>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>> +               return mask;
>>         }
>>
>> -       return PERF_REGS_MASK;
>> +       return 0;
>> +}
>> +
>> +uint64_t arch__intr_reg_mask(void)
>> +{
>> +       uint64_t mask = PERF_REGS_MASK;
>> +
>> +       if (has_cap_simd_regs()) {
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>> +                                        true);
> It's nice to label constant arguments like this something like:
> /*has_simd_regs=*/true);
>
> Tools like clang-tidy even try to enforce the argument names match the comments.

Sure.


>
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>> +                                        true);
>> +       } else
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>> +
>> +       return mask;
>>  }
>>
>>  uint64_t arch__user_reg_mask(void)
>>  {
>> -       return PERF_REGS_MASK;
>> +       uint64_t mask = PERF_REGS_MASK;
>> +
>> +       if (has_cap_simd_regs()) {
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>> +                                        true);
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>> +                                        true);
>> +       }
>> +
>> +       return mask;
> The code is repetitive here, could we refactor into a single function
> passing in a user or instr value?

Sure. Would extract the common part.


>
>>  }
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index 56ebefd075f2..5d1d90cf9488 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>             !evsel__is_dummy_event(evsel)) {
>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>> +       }
>> +
>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>> +               if (opts->sample_pred_regs_qwords)
>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>> +               else
>> +                       attr->sample_simd_pred_reg_qwords = 1;
>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>         }
>>
>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>             !evsel__is_dummy_event(evsel)) {
>>                 attr->sample_regs_user |= opts->sample_user_regs;
>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>> +               evsel__set_sample_bit(evsel, REGS_USER);
>> +       }
>> +
>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>> +               if (opts->sample_pred_regs_qwords)
>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>> +               else
>> +                       attr->sample_simd_pred_reg_qwords = 1;
>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>         }
>>
>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>> index cda1c620968e..0bd100392889 100644
>> --- a/tools/perf/util/parse-regs-options.c
>> +++ b/tools/perf/util/parse-regs-options.c
>> @@ -4,19 +4,139 @@
>>  #include <stdint.h>
>>  #include <string.h>
>>  #include <stdio.h>
>> +#include <linux/bitops.h>
>>  #include "util/debug.h"
>>  #include <subcmd/parse-options.h>
>>  #include "util/perf_regs.h"
>>  #include "util/parse-regs-options.h"
>> +#include "record.h"
>> +
>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       if (!simd_mask)
>> +               return;
>> +
>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>> +               if (!(r->mask & simd_mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               if (bitmap)
>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>> +       }
>> +}
>> +
>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       if (!pred_mask)
>> +               return;
>> +
>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>> +               if (!(r->mask & pred_mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               if (bitmap)
>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>> +       }
>> +}
>> +
>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool matched = false;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>> +               if (strcasecmp(s, r->name))
>> +                       continue;
>> +               if (!fls64(r->mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               matched = true;
>> +               break;
>> +       }
>> +
>> +       /* Just need the highest qwords */
> I'm not following here. Does the bitmap need to handle gaps?

Currently no. In theory, the kernel supports user space only samples a
subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
supports 16 XMM registers on XMM), but it's not supported to avoid
introducing too much complexity in perf tools. Moreover, I don't think end
users have such requirement. In most cases, users should only know which
kinds of SIMD registers their programs use but usually doesn't know and
care about which exact SIMD register is used.


>
>> +       if (qwords > opts->sample_vec_regs_qwords) {
>> +               opts->sample_vec_regs_qwords = qwords;
>> +               if (intr)
>> +                       opts->sample_intr_vec_regs = bitmap;
>> +               else
>> +                       opts->sample_user_vec_regs = bitmap;
>> +       }
>> +
>> +       return matched;
>> +}
>> +
>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool matched = false;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>> +               if (strcasecmp(s, r->name))
>> +                       continue;
>> +               if (!fls64(r->mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               matched = true;
>> +               break;
>> +       }
>> +
>> +       /* Just need the highest qwords */
> Again repetitive, could we have a single function?

Yes, I suppose the for loop at least can be extracted as a common function.


>
>> +       if (qwords > opts->sample_pred_regs_qwords) {
>> +               opts->sample_pred_regs_qwords = qwords;
>> +               if (intr)
>> +                       opts->sample_intr_pred_regs = bitmap;
>> +               else
>> +                       opts->sample_user_pred_regs = bitmap;
>> +       }
>> +
>> +       return matched;
>> +}
>>
>>  static int
>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>  {
>>         uint64_t *mode = (uint64_t *)opt->value;
>>         const struct sample_reg *r = NULL;
>> +       struct record_opts *opts;
>>         char *s, *os = NULL, *p;
>> -       int ret = -1;
>> +       bool has_simd_regs = false;
>>         uint64_t mask;
>> +       uint64_t simd_mask;
>> +       uint64_t pred_mask;
>> +       int ret = -1;
>>
>>         if (unset)
>>                 return 0;
>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>         if (*mode)
>>                 return -1;
>>
>> -       if (intr)
>> +       if (intr) {
>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>                 mask = arch__intr_reg_mask();
>> -       else
>> +               simd_mask = arch__intr_simd_reg_mask();
>> +               pred_mask = arch__intr_pred_reg_mask();
>> +       } else {
>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>                 mask = arch__user_reg_mask();
>> +               simd_mask = arch__user_simd_reg_mask();
>> +               pred_mask = arch__user_pred_reg_mask();
>> +       }
>>
>>         /* str may be NULL in case no arg is passed to -I */
>>         if (str) {
>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>                                         if (r->mask & mask)
>>                                                 fprintf(stderr, "%s ", r->name);
>>                                 }
>> +                               __print_simd_regs(intr, simd_mask);
>> +                               __print_pred_regs(intr, pred_mask);
>>                                 fputc('\n', stderr);
>>                                 /* just printing available regs */
>>                                 goto error;
>>                         }
>> +
>> +                       if (simd_mask) {
>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>> +                               if (has_simd_regs)
>> +                                       goto next;
>> +                       }
>> +                       if (pred_mask) {
>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>> +                               if (has_simd_regs)
>> +                                       goto next;
>> +                       }
>> +
>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>                                         break;
>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>                         }
>>
>>                         *mode |= r->mask;
>> -
>> +next:
>>                         if (!p)
>>                                 break;
>>
>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>         ret = 0;
>>
>>         /* default to all possible regs */
>> -       if (*mode == 0)
>> +       if (*mode == 0 && !has_simd_regs)
>>                 *mode = mask;
>>  error:
>>         free(os);
>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>> index 66b666d9ce64..fb0366d050cf 100644
>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>         PRINT_ATTRf(aux_resume, p_unsigned);
>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>
>>         return ret;
>>  }
>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>> index 44b90bbf2d07..e8a9fabc92e6 100644
>> --- a/tools/perf/util/perf_regs.c
>> +++ b/tools/perf/util/perf_regs.c
>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>         return SDT_ARG_SKIP;
>>  }
>>
>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>> +{
>> +       return false;
>> +}
>> +
>>  uint64_t __weak arch__intr_reg_mask(void)
>>  {
>>         return 0;
>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>         return 0;
>>  }
>>
>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_simd_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_pred_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>>  static const struct sample_reg sample_reg_masks[] = {
>>         SMPL_REG_END
>>  };
>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>         return sample_reg_masks;
>>  }
>>
>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>> +{
>> +       return sample_reg_masks;
>> +}
>> +
>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>> +{
>> +       return sample_reg_masks;
>> +}
> Thinking out loud. I wonder if there is a way to hide the weak
> functions. It seems the support is tied to PMUs, particularly core
> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
> ask the PMU to parse the register strings, set up the perf_event_attr,
> etc. I'm somewhat scared these functions will be used on the report
> rather than record side of things, thereby breaking perf.data support
> when the host kernel does or doesn't have the SIMD support.

Ian, I don't quite follow your words.

I don't quite understand how should we do for "push things into pmu and
arch pmu code". Current SIMD registers support follows the same way of the
general registers support. If we intend to change the way entirely, we'd
better have an independent patch-set.

why these functions would break the perf.data repport? perf-report would
check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
the flags is set (indicates there are SIMD registers data appended in the
record), perf-report would try to parse the SIMD registers data.


>
> Thanks,
> Ian
>
>> +
>>  const char *perf_reg_name(int id, const char *arch)
>>  {
>>         const char *reg_name = NULL;
>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>> index f2d0736d65cc..bce9c4cfd1bf 100644
>> --- a/tools/perf/util/perf_regs.h
>> +++ b/tools/perf/util/perf_regs.h
>> @@ -24,9 +24,20 @@ enum {
>>  };
>>
>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>> +bool arch_has_simd_regs(u64 mask);
>>  uint64_t arch__intr_reg_mask(void);
>>  uint64_t arch__user_reg_mask(void);
>>  const struct sample_reg *arch__sample_reg_masks(void);
>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>> +uint64_t arch__intr_simd_reg_mask(void);
>> +uint64_t arch__user_simd_reg_mask(void);
>> +uint64_t arch__intr_pred_reg_mask(void);
>> +uint64_t arch__user_pred_reg_mask(void);
>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>
>>  const char *perf_reg_name(int id, const char *arch);
>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>> index ea3a6c4657ee..825ffb4cc53f 100644
>> --- a/tools/perf/util/record.h
>> +++ b/tools/perf/util/record.h
>> @@ -59,7 +59,13 @@ struct record_opts {
>>         unsigned int  user_freq;
>>         u64           branch_stack;
>>         u64           sample_intr_regs;
>> +       u64           sample_intr_vec_regs;
>>         u64           sample_user_regs;
>> +       u64           sample_user_vec_regs;
>> +       u16           sample_pred_regs_qwords;
>> +       u16           sample_vec_regs_qwords;
>> +       u16           sample_intr_pred_regs;
>> +       u16           sample_user_pred_regs;
>>         u64           default_interval;
>>         u64           user_interval;
>>         size_t        auxtrace_snapshot_size;
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf
  2025-12-04  0:24 ` [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Ian Rogers
@ 2025-12-04  3:28   ` Mi, Dapeng
  0 siblings, 0 replies; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-04  3:28 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 12/4/2025 8:24 AM, Ian Rogers wrote:
> On Tue, Dec 2, 2025 at 10:58 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>> Changes since V4:
>> - Rewrite some functions comments and commit messages (Dave)
>> - Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
>> - Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
>>   activating back-to-back NMI detection mechanism (Patch 16/19)
>> - Fix some minor issues on perf-tool patches (Patch 18/19)
>>
>> Changes since V3:
>> - Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
>> - Only dump the available regs, rather than zero and dump the
>>   unavailable regs. It's possible that the dumped registers are a subset
>>   of the requested registers.
>> - Some minor updates to address Dapeng's comments in V3.
>>
>> Changes since V2:
>> - Use the FPU format for the x86_pmu.ext_regs_mask as well
>> - Add a check before invoking xsaves_nmi()
>> - Add perf_simd_reg_check() to retrieve the number of available
>>   registers. If the kernel fails to get the requested registers, e.g.,
>>   XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
>> - Add POC perf tool patches
>>
>> Changes since V1:
>> - Apply the new interfaces to configure and dump the SIMD registers
>> - Utilize the existing FPU functions, e.g., xstate_calculate_size,
>>   get_xsave_addr().
>>
>> Starting from Intel Ice Lake, XMM registers can be collected in a PEBS
>> record. Future Architecture PEBS will include additional registers such
>> as YMM, ZMM, OPMASK, SSP and APX eGPRs, contingent on hardware support.
>>
>> This patch set introduces a software solution to mitigate the hardware
>> requirement by utilizing the XSAVES command to retrieve the requested
>> registers in the overflow handler. This feature is no longer limited to
>> PEBS events or specific platforms. While the hardware solution remains
>> preferable due to its lower overhead and higher accuracy, this software
>> approach provides a viable alternative.
>>
>> The solution is theoretically compatible with all x86 platforms but is
>> currently enabled on newer platforms, including Sapphire Rapids and
>> later P-core server platforms, Sierra Forest and later E-core server
>> platforms and recent Client platforms, like Arrow Lake, Panther Lake and
>> Nova Lake.
>>
>> Newly supported registers include YMM, ZMM, OPMASK, SSP, and APX eGPRs.
>> Due to space constraints in sample_regs_user/intr, new fields have been
>> introduced in the perf_event_attr structure to accommodate these
>> registers.
>>
>> After a long discussion in V1,
>> https://lore.kernel.org/lkml/3f1c9a9e-cb63-47ff-a5e9-06555fa6cc9a@linux.intel.com/
>> The below new fields are introduced.
>>
>> @@ -543,6 +545,25 @@ struct perf_event_attr {
>>         __u64   sig_data;
>>
>>         __u64   config3; /* extension of config2 */
>> +
>> +
>> +       /*
>> +        * Defines set of SIMD registers to dump on samples.
>> +        * The sample_simd_regs_enabled !=0 implies the
>> +        * set of SIMD registers is used to config all SIMD registers.
>> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
>> +        * config some SIMD registers on X86.
>> +        */
>> +       union {
>> +               __u16 sample_simd_regs_enabled;
>> +               __u16 sample_simd_pred_reg_qwords;
>> +       };
>> +       __u32 sample_simd_pred_reg_intr;
>> +       __u32 sample_simd_pred_reg_user;
>> +       __u16 sample_simd_vec_reg_qwords;
>> +       __u64 sample_simd_vec_reg_intr;
>> +       __u64 sample_simd_vec_reg_user;
>> +       __u32 __reserved_4;
>>  };
>> @@ -1016,7 +1037,15 @@ enum perf_event_type {
>>          *      } && PERF_SAMPLE_BRANCH_STACK
>>          *
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +        *      } && PERF_SAMPLE_REGS_USER
>>          *
>>          *      { u64                   size;
>>          *        char                  data[size];
>> @@ -1043,7 +1072,15 @@ enum perf_event_type {
>>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +        *      } && PERF_SAMPLE_REGS_INTR
>>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>>
>>
>> To maintain simplicity, a single field, sample_{simd|pred}_vec_reg_qwords,
>> is introduced to indicate register width. For example:
>> - sample_simd_vec_reg_qwords = 2 for XMM registers (128 bits) on x86
>> - sample_simd_vec_reg_qwords = 4 for YMM registers (256 bits) on x86
>>
>> Four additional fields, sample_{simd|pred}_vec_reg_{intr|user}, represent
>> the bitmap of sampling registers. For instance, the bitmap for x86
>> XMM registers is 0xffff (16 XMM registers). Although users can
>> theoretically sample a subset of registers, the current perf-tool
>> implementation supports sampling all registers of each type to avoid
>> complexity.
>>
>> A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is introduced to signal user space
>> tools about the presence of SIMD registers in sampling records. When this
>> flag is detected, tools should recognize that extra SIMD register data
>> follows the general register data. The layout of the extra SIMD register
>> data is displayed as follow.
>>
>>    u16 nr_vectors;
>>    u16 vector_qwords;
>>    u16 nr_pred;
>>    u16 pred_qwords;
>>    u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>>
>> With this patch set, sampling for the aforementioned registers is
>> supported on the Intel Nova Lake platform.
>>
>> Examples:
>>  $perf record -I?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> nit: It seems strange in this output to mix ranges like "XMM0-15" but
> then list out "R8....R31". That said we have tests that explicitly
> look for the non-range pattern:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell/record.sh?h=perf-tools-next#n106

The reason that we list each GPR separately is that each GPR including (R15
~ R31) can be sampled independently although kernel reads eGPRs (R15 ~R31)
as a whole by leveraging xsaves instruction. However SIMD registers can
only be sampled and shown as a whole. 

That's why we display the registers as current format.


>
> Thanks,
> Ian
>
>>  $perf record --user-regs=?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>
>>  $perf record -e branches:p -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -c 100000 ./test
>>  $perf report -D
>>
>>  ... ...
>>  14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
>>  0xffffffff9f085e24 period: 100000 addr: 0
>>  ... intr regs: mask 0x18001010003 ABI 64-bit
>>  .... AX    0xdffffc0000000000
>>  .... BX    0xffff8882297685e8
>>  .... R8    0x0000000000000000
>>  .... R16   0x0000000000000000
>>  .... R31   0x0000000000000000
>>  .... SSP   0x0000000000000000
>>  ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
>>  .... ZMM  [0] 0xffffffffffffffff
>>  .... ZMM  [0] 0x0000000000000001
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [1] 0x003a6b6165506d56
>>  ... ...
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... OPMASK[0] 0x00000000fffffe00
>>  .... OPMASK[1] 0x0000000000ffffff
>>  .... OPMASK[2] 0x000000000000007f
>>  .... OPMASK[3] 0x0000000000000000
>>  .... OPMASK[4] 0x0000000000010080
>>  .... OPMASK[5] 0x0000000000000000
>>  .... OPMASK[6] 0x0000400004000000
>>  .... OPMASK[7] 0x0000000000000000
>>  ... ...
>>
>>
>> History:
>>   v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
>>   v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
>>   v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
>>   v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/
>>
>> Dapeng Mi (3):
>>   perf: Eliminate duplicate arch-specific functions definations
>>   perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling
>>   perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
>>     NMIs
>>
>> Kan Liang (16):
>>   perf/x86: Use x86_perf_regs in the x86 nmi handler
>>   perf/x86: Introduce x86-specific x86_pmu_setup_regs_data()
>>   x86/fpu/xstate: Add xsaves_nmi() helper
>>   perf: Move and rename has_extended_regs() for ARCH-specific use
>>   perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
>>   perf: Add sampling support for SIMD registers
>>   perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
>>   perf/x86: Enable YMM sampling using sample_simd_vec_reg_* fields
>>   perf/x86: Enable ZMM sampling using sample_simd_vec_reg_* fields
>>   perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields
>>   perf/x86: Enable eGPRs sampling using sample_regs_* fields
>>   perf/x86: Enable SSP sampling using sample_regs_* fields
>>   perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
>>   perf headers: Sync with the kernel headers
>>   perf parse-regs: Support new SIMD sampling format
>>   perf regs: Enable dumping of SIMD registers
>>
>>  arch/arm/kernel/perf_regs.c                   |   8 +-
>>  arch/arm64/kernel/perf_regs.c                 |   8 +-
>>  arch/csky/kernel/perf_regs.c                  |   8 +-
>>  arch/loongarch/kernel/perf_regs.c             |   8 +-
>>  arch/mips/kernel/perf_regs.c                  |   8 +-
>>  arch/parisc/kernel/perf_regs.c                |   8 +-
>>  arch/powerpc/perf/perf_regs.c                 |   2 +-
>>  arch/riscv/kernel/perf_regs.c                 |   8 +-
>>  arch/s390/kernel/perf_regs.c                  |   2 +-
>>  arch/x86/events/core.c                        | 326 +++++++++++-
>>  arch/x86/events/intel/core.c                  | 117 ++++-
>>  arch/x86/events/intel/ds.c                    | 134 ++++-
>>  arch/x86/events/perf_event.h                  |  85 +++-
>>  arch/x86/include/asm/fpu/xstate.h             |   3 +
>>  arch/x86/include/asm/msr-index.h              |   7 +
>>  arch/x86/include/asm/perf_event.h             |  38 +-
>>  arch/x86/include/uapi/asm/perf_regs.h         |  62 +++
>>  arch/x86/kernel/fpu/xstate.c                  |  25 +-
>>  arch/x86/kernel/perf_regs.c                   | 131 ++++-
>>  include/linux/perf_event.h                    |  16 +
>>  include/linux/perf_regs.h                     |  36 +-
>>  include/uapi/linux/perf_event.h               |  45 +-
>>  kernel/events/core.c                          | 132 ++++-
>>  tools/arch/x86/include/uapi/asm/perf_regs.h   |  62 +++
>>  tools/include/uapi/linux/perf_event.h         |  45 +-
>>  tools/perf/arch/x86/util/perf_regs.c          | 470 +++++++++++++++++-
>>  tools/perf/util/evsel.c                       |  47 ++
>>  tools/perf/util/parse-regs-options.c          | 151 +++++-
>>  .../perf/util/perf-regs-arch/perf_regs_x86.c  |  43 ++
>>  tools/perf/util/perf_event_attr_fprintf.c     |   6 +
>>  tools/perf/util/perf_regs.c                   |  59 +++
>>  tools/perf/util/perf_regs.h                   |  11 +
>>  tools/perf/util/record.h                      |   6 +
>>  tools/perf/util/sample.h                      |  10 +
>>  tools/perf/util/session.c                     |  78 ++-
>>  35 files changed, 2012 insertions(+), 193 deletions(-)
>>
>>
>> base-commit: 9929dffce5ed7e2988e0274f4db98035508b16d9
>> prerequisite-patch-id: a15bcd62a8dcd219d17489eef88b66ea5488a2a0
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2025-12-04  1:37     ` Mi, Dapeng
@ 2025-12-04  7:28       ` Ian Rogers
  0 siblings, 0 replies; 55+ messages in thread
From: Ian Rogers @ 2025-12-04  7:28 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 3, 2025 at 5:38 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 12/4/2025 7:43 AM, Ian Rogers wrote:
> > On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> Update include/uapi/linux/perf_event.h and
> >> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
> >>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> ---
> >>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
> >>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
> >>  2 files changed, 103 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
> >> index 7c9d2bb3833b..f3561ed10041 100644
> >> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
> >> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
> >> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
> >>         PERF_REG_X86_R13,
> >>         PERF_REG_X86_R14,
> >>         PERF_REG_X86_R15,
> >> +       /*
> >> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
> >> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
> >> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
> >> +        *
> >> +        * Extended GPRs (EGPRs)
> >> +        */
> >> +       PERF_REG_X86_R16,
> >> +       PERF_REG_X86_R17,
> >> +       PERF_REG_X86_R18,
> >> +       PERF_REG_X86_R19,
> >> +       PERF_REG_X86_R20,
> >> +       PERF_REG_X86_R21,
> >> +       PERF_REG_X86_R22,
> >> +       PERF_REG_X86_R23,
> >> +       PERF_REG_X86_R24,
> >> +       PERF_REG_X86_R25,
> >> +       PERF_REG_X86_R26,
> >> +       PERF_REG_X86_R27,
> >> +       PERF_REG_X86_R28,
> >> +       PERF_REG_X86_R29,
> >> +       PERF_REG_X86_R30,
> >> +       PERF_REG_X86_R31,
> >> +       PERF_REG_X86_SSP,
> >>         /* These are the limits for the GPRs. */
> >>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
> >>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
> >> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
> > I wonder MISC isn't the most intention revealing name. What happens if
> > things are extended again? Would APX be a better alternative, so
> > PERF_REG_APX_MAX ?
>
> Hmm, I don't think PERF_REG_APX_MAX is a good name either since there is
> SSP as well besides the APX eGPRs, and there could be more registers
> introduced in the future.
>
> How about PERF_REG_X86_EXTD_MAX?

Sounds good to me, especially with the eGPR already using the term extended.

Thanks,
Ian

> >
> >>         /* These all need two bits set because they are 128bit */
> >>         PERF_REG_X86_XMM0  = 32,
> >> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
> >>  };
> >>
> >>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
> >> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
> >> +
> >> +enum {
> >> +       PERF_REG_X86_XMM,
> >> +       PERF_REG_X86_YMM,
> >> +       PERF_REG_X86_ZMM,
> >> +       PERF_REG_X86_MAX_SIMD_REGS,
> >> +
> >> +       PERF_REG_X86_OPMASK = 0,
> >> +       PERF_REG_X86_MAX_PRED_REGS = 1,
> >> +};
> >> +
> >> +enum {
> >> +       PERF_X86_SIMD_XMM_REGS      = 16,
> >> +       PERF_X86_SIMD_YMM_REGS      = 16,
> >> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
> >> +       PERF_X86_SIMD_ZMM_REGS      = 32,
> >> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
> >> +
> >> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
> >> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
> >> +};
> >> +
> >> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
> >> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
> >> +
> >> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
> >> +
> >> +enum {
> >> +       PERF_X86_OPMASK_QWORDS   = 1,
> >> +       PERF_X86_XMM_QWORDS      = 2,
> >> +       PERF_X86_YMMH_QWORDS     = 2,
> >> +       PERF_X86_YMM_QWORDS      = 4,
> >> +       PERF_X86_ZMMH_QWORDS     = 4,
> >> +       PERF_X86_ZMM_QWORDS      = 8,
> >> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
> >> +};
> >>
> >>  #endif /* _ASM_X86_PERF_REGS_H */
> >> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> >> index d292f96bc06f..f1474da32622 100644
> >> --- a/tools/include/uapi/linux/perf_event.h
> >> +++ b/tools/include/uapi/linux/perf_event.h
> >> @@ -314,8 +314,9 @@ enum {
> >>   */
> >>  enum perf_sample_regs_abi {
> >>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
> >> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
> >> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
> >> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
> >> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
> >> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
> >>  };
> >>
> >>  /*
> >> @@ -382,6 +383,7 @@ enum perf_event_read_format {
> >>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
> >>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
> >>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
> >> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
> > ARM have added a config4 in:
> > https://lore.kernel.org/lkml/20251111-james-perf-feat_spe_eft-v10-1-1e1b5bf2cd05@linaro.org/
> > so this will need to be VER10.
>
> Thanks. It looks the ARM changes have been merged, so we can change it to
> VER10 in next version.
>
>
> >
> > Thanks,
> > Ian
> >
> >>  /*
> >>   * 'struct perf_event_attr' contains various attributes that define
> >> @@ -545,6 +547,25 @@ struct perf_event_attr {
> >>         __u64   sig_data;
> >>
> >>         __u64   config3; /* extension of config2 */
> >> +
> >> +
> >> +       /*
> >> +        * Defines set of SIMD registers to dump on samples.
> >> +        * The sample_simd_regs_enabled !=0 implies the
> >> +        * set of SIMD registers is used to config all SIMD registers.
> >> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> >> +        * config some SIMD registers on X86.
> >> +        */
> >> +       union {
> >> +               __u16 sample_simd_regs_enabled;
> >> +               __u16 sample_simd_pred_reg_qwords;
> >> +       };
> >> +       __u32 sample_simd_pred_reg_intr;
> >> +       __u32 sample_simd_pred_reg_user;
> >> +       __u16 sample_simd_vec_reg_qwords;
> >> +       __u64 sample_simd_vec_reg_intr;
> >> +       __u64 sample_simd_vec_reg_user;
> >> +       __u32 __reserved_4;
> >>  };
> >>
> >>  /*
> >> @@ -1018,7 +1039,15 @@ enum perf_event_type {
> >>          *      } && PERF_SAMPLE_BRANCH_STACK
> >>          *
> >>          *      { u64                   abi; # enum perf_sample_regs_abi
> >> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> >> +        *        u64                   regs[weight(mask)];
> >> +        *        struct {
> >> +        *              u16 nr_vectors;
> >> +        *              u16 vector_qwords;
> >> +        *              u16 nr_pred;
> >> +        *              u16 pred_qwords;
> >> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> >> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> >> +        *      } && PERF_SAMPLE_REGS_USER
> >>          *
> >>          *      { u64                   size;
> >>          *        char                  data[size];
> >> @@ -1045,7 +1074,15 @@ enum perf_event_type {
> >>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
> >>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
> >>          *      { u64                   abi; # enum perf_sample_regs_abi
> >> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> >> +        *        u64                   regs[weight(mask)];
> >> +        *        struct {
> >> +        *              u16 nr_vectors;
> >> +        *              u16 vector_qwords;
> >> +        *              u16 nr_pred;
> >> +        *              u16 pred_qwords;
> >> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> >> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> >> +        *      } && PERF_SAMPLE_REGS_INTR
> >>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
> >>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
> >>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
> >> --
> >> 2.34.1
> >>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-04  2:58     ` Mi, Dapeng
@ 2025-12-04  7:49       ` Ian Rogers
  2025-12-04  9:20         ` Mi, Dapeng
  0 siblings, 1 reply; 55+ messages in thread
From: Ian Rogers @ 2025-12-04  7:49 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 12/4/2025 8:17 AM, Ian Rogers wrote:
> > On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> This patch adds support for the newly introduced SIMD register sampling
> >> format by adding the following functions:
> >>
> >> uint64_t arch__intr_simd_reg_mask(void);
> >> uint64_t arch__user_simd_reg_mask(void);
> >> uint64_t arch__intr_pred_reg_mask(void);
> >> uint64_t arch__user_pred_reg_mask(void);
> >> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>
> >> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>
> >> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >> supported PRED registers, such as OPMASK on x86 platforms.
> >>
> >> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >> exact bitmap and number of qwords for a specific type of SIMD register.
> >> For example, for XMM registers on x86 platforms, the returned bitmap is
> >> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>
> >> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >> exact bitmap and number of qwords for a specific type of PRED register.
> >> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >> OPMASK).
> >>
> >> Additionally, the function __parse_regs() is enhanced to support parsing
> >> these newly introduced SIMD registers. Currently, each type of register
> >> can only be sampled collectively; sampling a specific SIMD register is
> >> not supported. For example, all XMM registers are sampled together rather
> >> than sampling only XMM0.
> >>
> >> When multiple overlapping register types, such as XMM and YMM, are
> >> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>
> >> With this patch, all supported sampling registers on x86 platforms are
> >> displayed as follows.
> >>
> >>  $perf record -I?
> >>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>
> >>  $perf record --user-regs=?
> >>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> ---
> >>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>  tools/perf/util/evsel.c                   |  27 ++
> >>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>  tools/perf/util/perf_regs.c               |  59 +++
> >>  tools/perf/util/perf_regs.h               |  11 +
> >>  tools/perf/util/record.h                  |   6 +
> >>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >> index 12fd93f04802..db41430f3b07 100644
> >> --- a/tools/perf/arch/x86/util/perf_regs.c
> >> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >> @@ -13,6 +13,49 @@
> >>  #include "../../../util/pmu.h"
> >>  #include "../../../util/pmus.h"
> >>
> >> +static const struct sample_reg sample_reg_masks_ext[] = {
> >> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >> +#endif
> >> +       SMPL_REG_END
> >> +};
> >> +
> >>  static const struct sample_reg sample_reg_masks[] = {
> >>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>         SMPL_REG(BX, PERF_REG_X86_BX),
> >> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>         return SDT_ARG_VALID;
> >>  }
> >>
> >> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> > To make the code easier to read, it'd be nice to document sample_type,
> > qwords and mask here.
>
> Sure.
>
>
> >
> >> +{
> >> +       struct perf_event_attr attr = {
> >> +               .type                           = PERF_TYPE_HARDWARE,
> >> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >> +               .sample_type                    = sample_type,
> >> +               .disabled                       = 1,
> >> +               .exclude_kernel                 = 1,
> >> +               .sample_simd_regs_enabled       = 1,
> >> +       };
> >> +       int fd;
> >> +
> >> +       attr.sample_period = 1;
> >> +
> >> +       if (!pred) {
> >> +               attr.sample_simd_vec_reg_qwords = qwords;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       attr.sample_simd_vec_reg_intr = mask;
> >> +               else
> >> +                       attr.sample_simd_vec_reg_user = mask;
> >> +       } else {
> >> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >> +               else
> >> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >> +       }
> >> +
> >> +       if (perf_pmus__num_core_pmus() > 1) {
> >> +               struct perf_pmu *pmu = NULL;
> >> +               __u64 type = PERF_TYPE_RAW;
> > It should be okay to do:
> > __u64 type = perf_pmus__find_core_pmu()->type
> > rather than have the whole loop below.
>
> Sure. Thanks.
>
>
> >
> >> +
> >> +               /*
> >> +                * The same register set is supported among different hybrid PMUs.
> >> +                * Only check the first available one.
> >> +                */
> >> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >> +                       type = pmu->type;
> >> +                       break;
> >> +               }
> >> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >> +       }
> >> +
> >> +       event_attr_init(&attr);
> >> +
> >> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >> +       if (fd != -1) {
> >> +               close(fd);
> >> +               return true;
> >> +       }
> >> +
> >> +       return false;
> >> +}
> >> +
> >> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >> +{
> >> +       bool supported = false;
> >> +       u64 bits;
> >> +
> >> +       *mask = 0;
> >> +       *qwords = 0;
> >> +
> >> +       switch (reg) {
> >> +       case PERF_REG_X86_XMM:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_XMM_QWORDS;
> >> +               }
> >> +               break;
> >> +       case PERF_REG_X86_YMM:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_YMM_QWORDS;
> >> +               }
> >> +               break;
> >> +       case PERF_REG_X86_ZMM:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >> +                       break;
> >> +               }
> >> +
> >> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >> +               }
> >> +               break;
> >> +       default:
> >> +               break;
> >> +       }
> >> +
> >> +       return supported;
> >> +}
> >> +
> >> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >> +{
> >> +       bool supported = false;
> >> +       u64 bits;
> >> +
> >> +       *mask = 0;
> >> +       *qwords = 0;
> >> +
> >> +       switch (reg) {
> >> +       case PERF_REG_X86_OPMASK:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >> +               }
> >> +               break;
> >> +       default:
> >> +               break;
> >> +       }
> >> +
> >> +       return supported;
> >> +}
> >> +
> >> +static bool has_cap_simd_regs(void)
> >> +{
> >> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >> +       static bool has_cap_simd_regs;
> >> +       static bool cached;
> >> +
> >> +       if (cached)
> >> +               return has_cap_simd_regs;
> >> +
> >> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >> +       cached = true;
> >> +
> >> +       return has_cap_simd_regs;
> >> +}
> >> +
> >> +bool arch_has_simd_regs(u64 mask)
> >> +{
> >> +       return has_cap_simd_regs() &&
> >> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >> +}
> >> +
> >> +static const struct sample_reg sample_simd_reg_masks[] = {
> >> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >> +       SMPL_REG_END
> >> +};
> >> +
> >> +static const struct sample_reg sample_pred_reg_masks[] = {
> >> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >> +       SMPL_REG_END
> >> +};
> >> +
> >> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >> +{
> >> +       return sample_simd_reg_masks;
> >> +}
> >> +
> >> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >> +{
> >> +       return sample_pred_reg_masks;
> >> +}
> >> +
> >> +static bool x86_intr_simd_updated;
> >> +static u64 x86_intr_simd_reg_mask;
> >> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> > Could we add some comments? I can kind of figure out the updated is a
> > check for lazy initialization and what masks are, qwords is an odd
> > one. The comment could also point out that SIMD doesn't mean the
> > machine supports SIMD, but SIMD registers are supported in perf
> > events.
>
> Sure.
>
>
> >
> >> +static bool x86_user_simd_updated;
> >> +static u64 x86_user_simd_reg_mask;
> >> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >> +
> >> +static bool x86_intr_pred_updated;
> >> +static u64 x86_intr_pred_reg_mask;
> >> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >> +static bool x86_user_pred_updated;
> >> +static u64 x86_user_pred_reg_mask;
> >> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >> +
> >> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool supported;
> >> +       u64 mask = 0;
> >> +       int reg;
> >> +
> >> +       if (!has_cap_simd_regs())
> >> +               return 0;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >> +               return x86_intr_simd_reg_mask;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >> +               return x86_user_simd_reg_mask;
> >> +
> >> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >> +               supported = false;
> >> +
> >> +               if (!r->mask)
> >> +                       continue;
> >> +               reg = fls64(r->mask) - 1;
> >> +
> >> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >> +                       break;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >> +                                                        &x86_intr_simd_mask[reg],
> >> +                                                        &x86_intr_simd_qwords[reg]);
> >> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >> +                                                        &x86_user_simd_mask[reg],
> >> +                                                        &x86_user_simd_qwords[reg]);
> >> +               if (supported)
> >> +                       mask |= BIT_ULL(reg);
> >> +       }
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >> +               x86_intr_simd_reg_mask = mask;
> >> +               x86_intr_simd_updated = true;
> >> +       } else {
> >> +               x86_user_simd_reg_mask = mask;
> >> +               x86_user_simd_updated = true;
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> >> +
> >> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool supported;
> >> +       u64 mask = 0;
> >> +       int reg;
> >> +
> >> +       if (!has_cap_simd_regs())
> >> +               return 0;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >> +               return x86_intr_pred_reg_mask;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >> +               return x86_user_pred_reg_mask;
> >> +
> >> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >> +               supported = false;
> >> +
> >> +               if (!r->mask)
> >> +                       continue;
> >> +               reg = fls64(r->mask) - 1;
> >> +
> >> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >> +                       break;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >> +                                                        &x86_intr_pred_mask[reg],
> >> +                                                        &x86_intr_pred_qwords[reg]);
> >> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >> +                                                        &x86_user_pred_mask[reg],
> >> +                                                        &x86_user_pred_qwords[reg]);
> >> +               if (supported)
> >> +                       mask |= BIT_ULL(reg);
> >> +       }
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >> +               x86_intr_pred_reg_mask = mask;
> >> +               x86_intr_pred_updated = true;
> >> +       } else {
> >> +               x86_user_pred_reg_mask = mask;
> >> +               x86_user_pred_updated = true;
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> > This feels repetitive with __arch__simd_reg_mask, could they be
> > refactored together?
>
> hmm, it looks we can extract the for loop as a common function. The other
> parts are hard to be generalized since they are manipulating different
> variables. If we want to generalize them, we have to introduce lots of "if
> ... else" branches and that would make code hard to be read.
>
>
> >
> >> +
> >> +uint64_t arch__intr_simd_reg_mask(void)
> >> +{
> >> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >> +}
> >> +
> >> +uint64_t arch__user_simd_reg_mask(void)
> >> +{
> >> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >> +}
> >> +
> >> +uint64_t arch__intr_pred_reg_mask(void)
> >> +{
> >> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >> +}
> >> +
> >> +uint64_t arch__user_pred_reg_mask(void)
> >> +{
> >> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >> +}
> >> +
> >> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >> +{
> >> +       uint64_t mask = 0;
> >> +
> >> +       *qwords = 0;
> >> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >> +               if (intr) {
> >> +                       *qwords = x86_intr_simd_qwords[reg];
> >> +                       mask = x86_intr_simd_mask[reg];
> >> +               } else {
> >> +                       *qwords = x86_user_simd_qwords[reg];
> >> +                       mask = x86_user_simd_mask[reg];
> >> +               }
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> >> +
> >> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >> +{
> >> +       uint64_t mask = 0;
> >> +
> >> +       *qwords = 0;
> >> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >> +               if (intr) {
> >> +                       *qwords = x86_intr_pred_qwords[reg];
> >> +                       mask = x86_intr_pred_mask[reg];
> >> +               } else {
> >> +                       *qwords = x86_user_pred_qwords[reg];
> >> +                       mask = x86_user_pred_mask[reg];
> >> +               }
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> >> +
> >> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_intr_simd_updated)
> >> +               arch__intr_simd_reg_mask();
> >> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >> +}
> >> +
> >> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_user_simd_updated)
> >> +               arch__user_simd_reg_mask();
> >> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >> +}
> >> +
> >> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_intr_pred_updated)
> >> +               arch__intr_pred_reg_mask();
> >> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >> +}
> >> +
> >> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_user_pred_updated)
> >> +               arch__user_pred_reg_mask();
> >> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >> +}
> >> +
> >>  const struct sample_reg *arch__sample_reg_masks(void)
> >>  {
> >> +       if (has_cap_simd_regs())
> >> +               return sample_reg_masks_ext;
> >>         return sample_reg_masks;
> >>  }
> >>
> >> -uint64_t arch__intr_reg_mask(void)
> >> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>  {
> >>         struct perf_event_attr attr = {
> >> -               .type                   = PERF_TYPE_HARDWARE,
> >> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >> -               .precise_ip             = 1,
> >> -               .disabled               = 1,
> >> -               .exclude_kernel         = 1,
> >> +               .type                           = PERF_TYPE_HARDWARE,
> >> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >> +               .sample_type                    = sample_type,
> >> +               .precise_ip                     = 1,
> >> +               .disabled                       = 1,
> >> +               .exclude_kernel                 = 1,
> >> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>         };
> >>         int fd;
> >>         /*
> >>          * In an unnamed union, init it here to build on older gcc versions
> >>          */
> >>         attr.sample_period = 1;
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +               attr.sample_regs_intr = mask;
> >> +       else
> >> +               attr.sample_regs_user = mask;
> >>
> >>         if (perf_pmus__num_core_pmus() > 1) {
> >>                 struct perf_pmu *pmu = NULL;
> >> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>         if (fd != -1) {
> >>                 close(fd);
> >> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >> +               return mask;
> >>         }
> >>
> >> -       return PERF_REGS_MASK;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t arch__intr_reg_mask(void)
> >> +{
> >> +       uint64_t mask = PERF_REGS_MASK;
> >> +
> >> +       if (has_cap_simd_regs()) {
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >> +                                        true);
> > It's nice to label constant arguments like this something like:
> > /*has_simd_regs=*/true);
> >
> > Tools like clang-tidy even try to enforce the argument names match the comments.
>
> Sure.
>
>
> >
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >> +                                        true);
> >> +       } else
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >> +
> >> +       return mask;
> >>  }
> >>
> >>  uint64_t arch__user_reg_mask(void)
> >>  {
> >> -       return PERF_REGS_MASK;
> >> +       uint64_t mask = PERF_REGS_MASK;
> >> +
> >> +       if (has_cap_simd_regs()) {
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >> +                                        true);
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >> +                                        true);
> >> +       }
> >> +
> >> +       return mask;
> > The code is repetitive here, could we refactor into a single function
> > passing in a user or instr value?
>
> Sure. Would extract the common part.
>
>
> >
> >>  }
> >> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >> index 56ebefd075f2..5d1d90cf9488 100644
> >> --- a/tools/perf/util/evsel.c
> >> +++ b/tools/perf/util/evsel.c
> >> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>             !evsel__is_dummy_event(evsel)) {
> >>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >> +       }
> >> +
> >> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >> +               if (opts->sample_pred_regs_qwords)
> >> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >> +               else
> >> +                       attr->sample_simd_pred_reg_qwords = 1;
> >> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>         }
> >>
> >>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>             !evsel__is_dummy_event(evsel)) {
> >>                 attr->sample_regs_user |= opts->sample_user_regs;
> >> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >> +               evsel__set_sample_bit(evsel, REGS_USER);
> >> +       }
> >> +
> >> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >> +               if (opts->sample_pred_regs_qwords)
> >> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >> +               else
> >> +                       attr->sample_simd_pred_reg_qwords = 1;
> >> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>         }
> >>
> >> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >> index cda1c620968e..0bd100392889 100644
> >> --- a/tools/perf/util/parse-regs-options.c
> >> +++ b/tools/perf/util/parse-regs-options.c
> >> @@ -4,19 +4,139 @@
> >>  #include <stdint.h>
> >>  #include <string.h>
> >>  #include <stdio.h>
> >> +#include <linux/bitops.h>
> >>  #include "util/debug.h"
> >>  #include <subcmd/parse-options.h>
> >>  #include "util/perf_regs.h"
> >>  #include "util/parse-regs-options.h"
> >> +#include "record.h"
> >> +
> >> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       if (!simd_mask)
> >> +               return;
> >> +
> >> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >> +               if (!(r->mask & simd_mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               if (bitmap)
> >> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >> +       }
> >> +}
> >> +
> >> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       if (!pred_mask)
> >> +               return;
> >> +
> >> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >> +               if (!(r->mask & pred_mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               if (bitmap)
> >> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >> +       }
> >> +}
> >> +
> >> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool matched = false;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >> +               if (strcasecmp(s, r->name))
> >> +                       continue;
> >> +               if (!fls64(r->mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               matched = true;
> >> +               break;
> >> +       }
> >> +
> >> +       /* Just need the highest qwords */
> > I'm not following here. Does the bitmap need to handle gaps?
>
> Currently no. In theory, the kernel supports user space only samples a
> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
> supports 16 XMM registers on XMM), but it's not supported to avoid
> introducing too much complexity in perf tools. Moreover, I don't think end
> users have such requirement. In most cases, users should only know which
> kinds of SIMD registers their programs use but usually doesn't know and
> care about which exact SIMD register is used.
>
>
> >
> >> +       if (qwords > opts->sample_vec_regs_qwords) {
> >> +               opts->sample_vec_regs_qwords = qwords;
> >> +               if (intr)
> >> +                       opts->sample_intr_vec_regs = bitmap;
> >> +               else
> >> +                       opts->sample_user_vec_regs = bitmap;
> >> +       }
> >> +
> >> +       return matched;
> >> +}
> >> +
> >> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool matched = false;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >> +               if (strcasecmp(s, r->name))
> >> +                       continue;
> >> +               if (!fls64(r->mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               matched = true;
> >> +               break;
> >> +       }
> >> +
> >> +       /* Just need the highest qwords */
> > Again repetitive, could we have a single function?
>
> Yes, I suppose the for loop at least can be extracted as a common function.
>
>
> >
> >> +       if (qwords > opts->sample_pred_regs_qwords) {
> >> +               opts->sample_pred_regs_qwords = qwords;
> >> +               if (intr)
> >> +                       opts->sample_intr_pred_regs = bitmap;
> >> +               else
> >> +                       opts->sample_user_pred_regs = bitmap;
> >> +       }
> >> +
> >> +       return matched;
> >> +}
> >>
> >>  static int
> >>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>  {
> >>         uint64_t *mode = (uint64_t *)opt->value;
> >>         const struct sample_reg *r = NULL;
> >> +       struct record_opts *opts;
> >>         char *s, *os = NULL, *p;
> >> -       int ret = -1;
> >> +       bool has_simd_regs = false;
> >>         uint64_t mask;
> >> +       uint64_t simd_mask;
> >> +       uint64_t pred_mask;
> >> +       int ret = -1;
> >>
> >>         if (unset)
> >>                 return 0;
> >> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>         if (*mode)
> >>                 return -1;
> >>
> >> -       if (intr)
> >> +       if (intr) {
> >> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>                 mask = arch__intr_reg_mask();
> >> -       else
> >> +               simd_mask = arch__intr_simd_reg_mask();
> >> +               pred_mask = arch__intr_pred_reg_mask();
> >> +       } else {
> >> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>                 mask = arch__user_reg_mask();
> >> +               simd_mask = arch__user_simd_reg_mask();
> >> +               pred_mask = arch__user_pred_reg_mask();
> >> +       }
> >>
> >>         /* str may be NULL in case no arg is passed to -I */
> >>         if (str) {
> >> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>                                         if (r->mask & mask)
> >>                                                 fprintf(stderr, "%s ", r->name);
> >>                                 }
> >> +                               __print_simd_regs(intr, simd_mask);
> >> +                               __print_pred_regs(intr, pred_mask);
> >>                                 fputc('\n', stderr);
> >>                                 /* just printing available regs */
> >>                                 goto error;
> >>                         }
> >> +
> >> +                       if (simd_mask) {
> >> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >> +                               if (has_simd_regs)
> >> +                                       goto next;
> >> +                       }
> >> +                       if (pred_mask) {
> >> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >> +                               if (has_simd_regs)
> >> +                                       goto next;
> >> +                       }
> >> +
> >>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>                                         break;
> >> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>                         }
> >>
> >>                         *mode |= r->mask;
> >> -
> >> +next:
> >>                         if (!p)
> >>                                 break;
> >>
> >> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>         ret = 0;
> >>
> >>         /* default to all possible regs */
> >> -       if (*mode == 0)
> >> +       if (*mode == 0 && !has_simd_regs)
> >>                 *mode = mask;
> >>  error:
> >>         free(os);
> >> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >> index 66b666d9ce64..fb0366d050cf 100644
> >> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>         PRINT_ATTRf(aux_resume, p_unsigned);
> >> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>
> >>         return ret;
> >>  }
> >> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >> index 44b90bbf2d07..e8a9fabc92e6 100644
> >> --- a/tools/perf/util/perf_regs.c
> >> +++ b/tools/perf/util/perf_regs.c
> >> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>         return SDT_ARG_SKIP;
> >>  }
> >>
> >> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >> +{
> >> +       return false;
> >> +}
> >> +
> >>  uint64_t __weak arch__intr_reg_mask(void)
> >>  {
> >>         return 0;
> >> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>         return 0;
> >>  }
> >>
> >> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_simd_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_pred_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >>  static const struct sample_reg sample_reg_masks[] = {
> >>         SMPL_REG_END
> >>  };
> >> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>         return sample_reg_masks;
> >>  }
> >>
> >> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >> +{
> >> +       return sample_reg_masks;
> >> +}
> >> +
> >> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >> +{
> >> +       return sample_reg_masks;
> >> +}
> > Thinking out loud. I wonder if there is a way to hide the weak
> > functions. It seems the support is tied to PMUs, particularly core
> > PMUs, perhaps we can push things into pmu and arch pmu code. Then we
> > ask the PMU to parse the register strings, set up the perf_event_attr,
> > etc. I'm somewhat scared these functions will be used on the report
> > rather than record side of things, thereby breaking perf.data support
> > when the host kernel does or doesn't have the SIMD support.
>
> Ian, I don't quite follow your words.
>
> I don't quite understand how should we do for "push things into pmu and
> arch pmu code". Current SIMD registers support follows the same way of the
> general registers support. If we intend to change the way entirely, we'd
> better have an independent patch-set.
>
> why these functions would break the perf.data repport? perf-report would
> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
> the flags is set (indicates there are SIMD registers data appended in the
> record), perf-report would try to parse the SIMD registers data.

Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
remove weak symbols like:
https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t

For these patches what I'm imagining is that there is a Nova Lake
generated perf.data file. Using perf report, script, etc. on the Nova
Lake should expose all of the same mask, qword, etc. values as when
the perf.data was generated and so things will work. If the perf.data
file was taken to say my Alderlake then what will happen? Generally
using the arch directory and weak symbols is a code smell that cross
platform things are going to break - there should be sufficient data
in the event and the perf_event_attr to fully decode what's going on.
Sometimes tying things to a PMU name can avoid the use of the arch
directory. We were able to avoid the arch directory to a good extent
for the TPEBS code, even though it is a very modern Intel feature.

Thanks,
Ian



> >
> > Thanks,
> > Ian
> >
> >> +
> >>  const char *perf_reg_name(int id, const char *arch)
> >>  {
> >>         const char *reg_name = NULL;
> >> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >> index f2d0736d65cc..bce9c4cfd1bf 100644
> >> --- a/tools/perf/util/perf_regs.h
> >> +++ b/tools/perf/util/perf_regs.h
> >> @@ -24,9 +24,20 @@ enum {
> >>  };
> >>
> >>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >> +bool arch_has_simd_regs(u64 mask);
> >>  uint64_t arch__intr_reg_mask(void);
> >>  uint64_t arch__user_reg_mask(void);
> >>  const struct sample_reg *arch__sample_reg_masks(void);
> >> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> >> +uint64_t arch__intr_simd_reg_mask(void);
> >> +uint64_t arch__user_simd_reg_mask(void);
> >> +uint64_t arch__intr_pred_reg_mask(void);
> >> +uint64_t arch__user_pred_reg_mask(void);
> >> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>
> >>  const char *perf_reg_name(int id, const char *arch);
> >>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >> index ea3a6c4657ee..825ffb4cc53f 100644
> >> --- a/tools/perf/util/record.h
> >> +++ b/tools/perf/util/record.h
> >> @@ -59,7 +59,13 @@ struct record_opts {
> >>         unsigned int  user_freq;
> >>         u64           branch_stack;
> >>         u64           sample_intr_regs;
> >> +       u64           sample_intr_vec_regs;
> >>         u64           sample_user_regs;
> >> +       u64           sample_user_vec_regs;
> >> +       u16           sample_pred_regs_qwords;
> >> +       u16           sample_vec_regs_qwords;
> >> +       u16           sample_intr_pred_regs;
> >> +       u16           sample_user_pred_regs;
> >>         u64           default_interval;
> >>         u64           user_interval;
> >>         size_t        auxtrace_snapshot_size;
> >> --
> >> 2.34.1
> >>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-04  7:49       ` Ian Rogers
@ 2025-12-04  9:20         ` Mi, Dapeng
  2025-12-04 16:16           ` Ian Rogers
  0 siblings, 1 reply; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-04  9:20 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/4/2025 3:49 PM, Ian Rogers wrote:
> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> This patch adds support for the newly introduced SIMD register sampling
>>>> format by adding the following functions:
>>>>
>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>> uint64_t arch__user_simd_reg_mask(void);
>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>> uint64_t arch__user_pred_reg_mask(void);
>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>
>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>
>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>
>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>
>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>> OPMASK).
>>>>
>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>> these newly introduced SIMD registers. Currently, each type of register
>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>> not supported. For example, all XMM registers are sampled together rather
>>>> than sampling only XMM0.
>>>>
>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>
>>>> With this patch, all supported sampling registers on x86 platforms are
>>>> displayed as follows.
>>>>
>>>>  $perf record -I?
>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>
>>>>  $perf record --user-regs=?
>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>
>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>> ---
>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>  tools/perf/util/record.h                  |   6 +
>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>> index 12fd93f04802..db41430f3b07 100644
>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>> @@ -13,6 +13,49 @@
>>>>  #include "../../../util/pmu.h"
>>>>  #include "../../../util/pmus.h"
>>>>
>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>> +#endif
>>>> +       SMPL_REG_END
>>>> +};
>>>> +
>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>         return SDT_ARG_VALID;
>>>>  }
>>>>
>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>> To make the code easier to read, it'd be nice to document sample_type,
>>> qwords and mask here.
>> Sure.
>>
>>
>>>> +{
>>>> +       struct perf_event_attr attr = {
>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>> +               .sample_type                    = sample_type,
>>>> +               .disabled                       = 1,
>>>> +               .exclude_kernel                 = 1,
>>>> +               .sample_simd_regs_enabled       = 1,
>>>> +       };
>>>> +       int fd;
>>>> +
>>>> +       attr.sample_period = 1;
>>>> +
>>>> +       if (!pred) {
>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>> +               else
>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>> +       } else {
>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>> +               else
>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>> +       }
>>>> +
>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>> +               struct perf_pmu *pmu = NULL;
>>>> +               __u64 type = PERF_TYPE_RAW;
>>> It should be okay to do:
>>> __u64 type = perf_pmus__find_core_pmu()->type
>>> rather than have the whole loop below.
>> Sure. Thanks.
>>
>>
>>>> +
>>>> +               /*
>>>> +                * The same register set is supported among different hybrid PMUs.
>>>> +                * Only check the first available one.
>>>> +                */
>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>> +                       type = pmu->type;
>>>> +                       break;
>>>> +               }
>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>> +       }
>>>> +
>>>> +       event_attr_init(&attr);
>>>> +
>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>> +       if (fd != -1) {
>>>> +               close(fd);
>>>> +               return true;
>>>> +       }
>>>> +
>>>> +       return false;
>>>> +}
>>>> +
>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>> +{
>>>> +       bool supported = false;
>>>> +       u64 bits;
>>>> +
>>>> +       *mask = 0;
>>>> +       *qwords = 0;
>>>> +
>>>> +       switch (reg) {
>>>> +       case PERF_REG_X86_XMM:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       case PERF_REG_X86_YMM:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       case PERF_REG_X86_ZMM:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>> +                       break;
>>>> +               }
>>>> +
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       default:
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       return supported;
>>>> +}
>>>> +
>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>> +{
>>>> +       bool supported = false;
>>>> +       u64 bits;
>>>> +
>>>> +       *mask = 0;
>>>> +       *qwords = 0;
>>>> +
>>>> +       switch (reg) {
>>>> +       case PERF_REG_X86_OPMASK:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       default:
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       return supported;
>>>> +}
>>>> +
>>>> +static bool has_cap_simd_regs(void)
>>>> +{
>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>> +       static bool has_cap_simd_regs;
>>>> +       static bool cached;
>>>> +
>>>> +       if (cached)
>>>> +               return has_cap_simd_regs;
>>>> +
>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>> +       cached = true;
>>>> +
>>>> +       return has_cap_simd_regs;
>>>> +}
>>>> +
>>>> +bool arch_has_simd_regs(u64 mask)
>>>> +{
>>>> +       return has_cap_simd_regs() &&
>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>> +}
>>>> +
>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>> +       SMPL_REG_END
>>>> +};
>>>> +
>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>> +       SMPL_REG_END
>>>> +};
>>>> +
>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>> +{
>>>> +       return sample_simd_reg_masks;
>>>> +}
>>>> +
>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>> +{
>>>> +       return sample_pred_reg_masks;
>>>> +}
>>>> +
>>>> +static bool x86_intr_simd_updated;
>>>> +static u64 x86_intr_simd_reg_mask;
>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>> Could we add some comments? I can kind of figure out the updated is a
>>> check for lazy initialization and what masks are, qwords is an odd
>>> one. The comment could also point out that SIMD doesn't mean the
>>> machine supports SIMD, but SIMD registers are supported in perf
>>> events.
>> Sure.
>>
>>
>>>> +static bool x86_user_simd_updated;
>>>> +static u64 x86_user_simd_reg_mask;
>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>> +
>>>> +static bool x86_intr_pred_updated;
>>>> +static u64 x86_intr_pred_reg_mask;
>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>> +static bool x86_user_pred_updated;
>>>> +static u64 x86_user_pred_reg_mask;
>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>> +
>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool supported;
>>>> +       u64 mask = 0;
>>>> +       int reg;
>>>> +
>>>> +       if (!has_cap_simd_regs())
>>>> +               return 0;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>> +               return x86_intr_simd_reg_mask;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>> +               return x86_user_simd_reg_mask;
>>>> +
>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>> +               supported = false;
>>>> +
>>>> +               if (!r->mask)
>>>> +                       continue;
>>>> +               reg = fls64(r->mask) - 1;
>>>> +
>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>> +                       break;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>> +                                                        &x86_intr_simd_mask[reg],
>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>> +                                                        &x86_user_simd_mask[reg],
>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>> +               if (supported)
>>>> +                       mask |= BIT_ULL(reg);
>>>> +       }
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>> +               x86_intr_simd_reg_mask = mask;
>>>> +               x86_intr_simd_updated = true;
>>>> +       } else {
>>>> +               x86_user_simd_reg_mask = mask;
>>>> +               x86_user_simd_updated = true;
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>>> +
>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool supported;
>>>> +       u64 mask = 0;
>>>> +       int reg;
>>>> +
>>>> +       if (!has_cap_simd_regs())
>>>> +               return 0;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>> +               return x86_intr_pred_reg_mask;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>> +               return x86_user_pred_reg_mask;
>>>> +
>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>> +               supported = false;
>>>> +
>>>> +               if (!r->mask)
>>>> +                       continue;
>>>> +               reg = fls64(r->mask) - 1;
>>>> +
>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>> +                       break;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>> +                                                        &x86_intr_pred_mask[reg],
>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>> +                                                        &x86_user_pred_mask[reg],
>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>> +               if (supported)
>>>> +                       mask |= BIT_ULL(reg);
>>>> +       }
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>> +               x86_intr_pred_reg_mask = mask;
>>>> +               x86_intr_pred_updated = true;
>>>> +       } else {
>>>> +               x86_user_pred_reg_mask = mask;
>>>> +               x86_user_pred_updated = true;
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>> This feels repetitive with __arch__simd_reg_mask, could they be
>>> refactored together?
>> hmm, it looks we can extract the for loop as a common function. The other
>> parts are hard to be generalized since they are manipulating different
>> variables. If we want to generalize them, we have to introduce lots of "if
>> ... else" branches and that would make code hard to be read.
>>
>>
>>>> +
>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>> +{
>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>> +{
>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>> +{
>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>> +{
>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>> +}
>>>> +
>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>> +{
>>>> +       uint64_t mask = 0;
>>>> +
>>>> +       *qwords = 0;
>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>> +               if (intr) {
>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>> +                       mask = x86_intr_simd_mask[reg];
>>>> +               } else {
>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>> +                       mask = x86_user_simd_mask[reg];
>>>> +               }
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>>> +
>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>> +{
>>>> +       uint64_t mask = 0;
>>>> +
>>>> +       *qwords = 0;
>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>> +               if (intr) {
>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>> +                       mask = x86_intr_pred_mask[reg];
>>>> +               } else {
>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>> +                       mask = x86_user_pred_mask[reg];
>>>> +               }
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_intr_simd_updated)
>>>> +               arch__intr_simd_reg_mask();
>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_user_simd_updated)
>>>> +               arch__user_simd_reg_mask();
>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_intr_pred_updated)
>>>> +               arch__intr_pred_reg_mask();
>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_user_pred_updated)
>>>> +               arch__user_pred_reg_mask();
>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>> +}
>>>> +
>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>  {
>>>> +       if (has_cap_simd_regs())
>>>> +               return sample_reg_masks_ext;
>>>>         return sample_reg_masks;
>>>>  }
>>>>
>>>> -uint64_t arch__intr_reg_mask(void)
>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>  {
>>>>         struct perf_event_attr attr = {
>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>> -               .precise_ip             = 1,
>>>> -               .disabled               = 1,
>>>> -               .exclude_kernel         = 1,
>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>> +               .sample_type                    = sample_type,
>>>> +               .precise_ip                     = 1,
>>>> +               .disabled                       = 1,
>>>> +               .exclude_kernel                 = 1,
>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>         };
>>>>         int fd;
>>>>         /*
>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>          */
>>>>         attr.sample_period = 1;
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +               attr.sample_regs_intr = mask;
>>>> +       else
>>>> +               attr.sample_regs_user = mask;
>>>>
>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>                 struct perf_pmu *pmu = NULL;
>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>         if (fd != -1) {
>>>>                 close(fd);
>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>> +               return mask;
>>>>         }
>>>>
>>>> -       return PERF_REGS_MASK;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_reg_mask(void)
>>>> +{
>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>> +
>>>> +       if (has_cap_simd_regs()) {
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>> +                                        true);
>>> It's nice to label constant arguments like this something like:
>>> /*has_simd_regs=*/true);
>>>
>>> Tools like clang-tidy even try to enforce the argument names match the comments.
>> Sure.
>>
>>
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>> +                                        true);
>>>> +       } else
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>> +
>>>> +       return mask;
>>>>  }
>>>>
>>>>  uint64_t arch__user_reg_mask(void)
>>>>  {
>>>> -       return PERF_REGS_MASK;
>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>> +
>>>> +       if (has_cap_simd_regs()) {
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>> +                                        true);
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>> +                                        true);
>>>> +       }
>>>> +
>>>> +       return mask;
>>> The code is repetitive here, could we refactor into a single function
>>> passing in a user or instr value?
>> Sure. Would extract the common part.
>>
>>
>>>>  }
>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>> --- a/tools/perf/util/evsel.c
>>>> +++ b/tools/perf/util/evsel.c
>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>             !evsel__is_dummy_event(evsel)) {
>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>> +       }
>>>> +
>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>> +               if (opts->sample_pred_regs_qwords)
>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>> +               else
>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>         }
>>>>
>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>             !evsel__is_dummy_event(evsel)) {
>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>> +       }
>>>> +
>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>> +               if (opts->sample_pred_regs_qwords)
>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>> +               else
>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>         }
>>>>
>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>> index cda1c620968e..0bd100392889 100644
>>>> --- a/tools/perf/util/parse-regs-options.c
>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>> @@ -4,19 +4,139 @@
>>>>  #include <stdint.h>
>>>>  #include <string.h>
>>>>  #include <stdio.h>
>>>> +#include <linux/bitops.h>
>>>>  #include "util/debug.h"
>>>>  #include <subcmd/parse-options.h>
>>>>  #include "util/perf_regs.h"
>>>>  #include "util/parse-regs-options.h"
>>>> +#include "record.h"
>>>> +
>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       if (!simd_mask)
>>>> +               return;
>>>> +
>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>> +               if (!(r->mask & simd_mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               if (bitmap)
>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>> +       }
>>>> +}
>>>> +
>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       if (!pred_mask)
>>>> +               return;
>>>> +
>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>> +               if (!(r->mask & pred_mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               if (bitmap)
>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>> +       }
>>>> +}
>>>> +
>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool matched = false;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>> +               if (strcasecmp(s, r->name))
>>>> +                       continue;
>>>> +               if (!fls64(r->mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               matched = true;
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       /* Just need the highest qwords */
>>> I'm not following here. Does the bitmap need to handle gaps?
>> Currently no. In theory, the kernel supports user space only samples a
>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
>> supports 16 XMM registers on XMM), but it's not supported to avoid
>> introducing too much complexity in perf tools. Moreover, I don't think end
>> users have such requirement. In most cases, users should only know which
>> kinds of SIMD registers their programs use but usually doesn't know and
>> care about which exact SIMD register is used.
>>
>>
>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>> +               if (intr)
>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>> +               else
>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>> +       }
>>>> +
>>>> +       return matched;
>>>> +}
>>>> +
>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool matched = false;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>> +               if (strcasecmp(s, r->name))
>>>> +                       continue;
>>>> +               if (!fls64(r->mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               matched = true;
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       /* Just need the highest qwords */
>>> Again repetitive, could we have a single function?
>> Yes, I suppose the for loop at least can be extracted as a common function.
>>
>>
>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>> +               if (intr)
>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>> +               else
>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>> +       }
>>>> +
>>>> +       return matched;
>>>> +}
>>>>
>>>>  static int
>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>  {
>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>         const struct sample_reg *r = NULL;
>>>> +       struct record_opts *opts;
>>>>         char *s, *os = NULL, *p;
>>>> -       int ret = -1;
>>>> +       bool has_simd_regs = false;
>>>>         uint64_t mask;
>>>> +       uint64_t simd_mask;
>>>> +       uint64_t pred_mask;
>>>> +       int ret = -1;
>>>>
>>>>         if (unset)
>>>>                 return 0;
>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>         if (*mode)
>>>>                 return -1;
>>>>
>>>> -       if (intr)
>>>> +       if (intr) {
>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>                 mask = arch__intr_reg_mask();
>>>> -       else
>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>> +       } else {
>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>                 mask = arch__user_reg_mask();
>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>> +       }
>>>>
>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>         if (str) {
>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>                                         if (r->mask & mask)
>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>                                 }
>>>> +                               __print_simd_regs(intr, simd_mask);
>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>                                 fputc('\n', stderr);
>>>>                                 /* just printing available regs */
>>>>                                 goto error;
>>>>                         }
>>>> +
>>>> +                       if (simd_mask) {
>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>> +                               if (has_simd_regs)
>>>> +                                       goto next;
>>>> +                       }
>>>> +                       if (pred_mask) {
>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>> +                               if (has_simd_regs)
>>>> +                                       goto next;
>>>> +                       }
>>>> +
>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>                                         break;
>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>                         }
>>>>
>>>>                         *mode |= r->mask;
>>>> -
>>>> +next:
>>>>                         if (!p)
>>>>                                 break;
>>>>
>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>         ret = 0;
>>>>
>>>>         /* default to all possible regs */
>>>> -       if (*mode == 0)
>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>                 *mode = mask;
>>>>  error:
>>>>         free(os);
>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>
>>>>         return ret;
>>>>  }
>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>> --- a/tools/perf/util/perf_regs.c
>>>> +++ b/tools/perf/util/perf_regs.c
>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>         return SDT_ARG_SKIP;
>>>>  }
>>>>
>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>> +{
>>>> +       return false;
>>>> +}
>>>> +
>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>  {
>>>>         return 0;
>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>         return 0;
>>>>  }
>>>>
>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>         SMPL_REG_END
>>>>  };
>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>         return sample_reg_masks;
>>>>  }
>>>>
>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>> +{
>>>> +       return sample_reg_masks;
>>>> +}
>>>> +
>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>> +{
>>>> +       return sample_reg_masks;
>>>> +}
>>> Thinking out loud. I wonder if there is a way to hide the weak
>>> functions. It seems the support is tied to PMUs, particularly core
>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
>>> ask the PMU to parse the register strings, set up the perf_event_attr,
>>> etc. I'm somewhat scared these functions will be used on the report
>>> rather than record side of things, thereby breaking perf.data support
>>> when the host kernel does or doesn't have the SIMD support.
>> Ian, I don't quite follow your words.
>>
>> I don't quite understand how should we do for "push things into pmu and
>> arch pmu code". Current SIMD registers support follows the same way of the
>> general registers support. If we intend to change the way entirely, we'd
>> better have an independent patch-set.
>>
>> why these functions would break the perf.data repport? perf-report would
>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
>> the flags is set (indicates there are SIMD registers data appended in the
>> record), perf-report would try to parse the SIMD registers data.
> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
> remove weak symbols like:
> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
>
> For these patches what I'm imagining is that there is a Nova Lake
> generated perf.data file. Using perf report, script, etc. on the Nova
> Lake should expose all of the same mask, qword, etc. values as when
> the perf.data was generated and so things will work. If the perf.data
> file was taken to say my Alderlake then what will happen? Generally
> using the arch directory and weak symbols is a code smell that cross
> platform things are going to break - there should be sufficient data
> in the event and the perf_event_attr to fully decode what's going on.
> Sometimes tying things to a PMU name can avoid the use of the arch
> directory. We were able to avoid the arch directory to a good extent
> for the TPEBS code, even though it is a very modern Intel feature.

I see. 

But the sampling support for SIMD registers is different with the sample
weight processing in the patch
https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
Each arch may support different kinds of SIMD registers and furthermore
each kind of SIMD register may have different register number and register
width. It's quite hard to figure out some common functions or fields to
represent the name and attributes of these arch-specific SIMD registers.
These arch-specific information can only be told by the arch-specific code.
So it looks the __weak functions are still the easiest way to implement this.

I don't think the perf.data parsing would be broken from a platform to
another different platform (same arch), e.g., from Nova Lake to Alder Lake.
To indicates the presence of SIMD registers in record data, a new ABI flag
"PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
platform is new enough and can recognize this new flag, then the SIMD
registers data would be parsed correctly. Even though the perf tool is old
and have no support of SIMD register, the data of SIMD registers would just
be silently ignored and should not break the parsing.


>
> Thanks,
> Ian
>
>
>
>>> Thanks,
>>> Ian
>>>
>>>> +
>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>  {
>>>>         const char *reg_name = NULL;
>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>> --- a/tools/perf/util/perf_regs.h
>>>> +++ b/tools/perf/util/perf_regs.h
>>>> @@ -24,9 +24,20 @@ enum {
>>>>  };
>>>>
>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>> +bool arch_has_simd_regs(u64 mask);
>>>>  uint64_t arch__intr_reg_mask(void);
>>>>  uint64_t arch__user_reg_mask(void);
>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>
>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>> --- a/tools/perf/util/record.h
>>>> +++ b/tools/perf/util/record.h
>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>         unsigned int  user_freq;
>>>>         u64           branch_stack;
>>>>         u64           sample_intr_regs;
>>>> +       u64           sample_intr_vec_regs;
>>>>         u64           sample_user_regs;
>>>> +       u64           sample_user_vec_regs;
>>>> +       u16           sample_pred_regs_qwords;
>>>> +       u16           sample_vec_regs_qwords;
>>>> +       u16           sample_intr_pred_regs;
>>>> +       u16           sample_user_pred_regs;
>>>>         u64           default_interval;
>>>>         u64           user_interval;
>>>>         size_t        auxtrace_snapshot_size;
>>>> --
>>>> 2.34.1
>>>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-03  6:54 ` [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER Dapeng Mi
@ 2025-12-04 15:17   ` Peter Zijlstra
  2025-12-04 15:47     ` Peter Zijlstra
  2025-12-04 18:59     ` Dave Hansen
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2025-12-04 15:17 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:47PM +0800, Dapeng Mi wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> While collecting XMM registers in a PEBS record has been supported since
> Icelake, non-PEBS events have lacked this feature. By leveraging the
> xsaves instruction, it is now possible to snapshot XMM registers for
> non-PEBS events, completing the feature set.
> 
> To utilize the xsaves instruction, a 64-byte aligned buffer is required.
> A per-CPU ext_regs_buf is added to store SIMD and other registers, with
> the buffer size being approximately 2K. The buffer is allocated using
> kzalloc_node(), ensuring natural alignment and 64-byte alignment for all
> kmalloc() allocations with powers of 2.
> 
> The XMM sampling support is extended for both REGS_USER and REGS_INTR.
> For REGS_USER, perf_get_regs_user() returns the registers from
> task_pt_regs(current), which is a pt_regs structure. It needs to be
> copied to user space secific x86_user_regs structure since kernel may
> modify pt_regs structure later.
> 
> For PEBS, XMM registers are retrieved from PEBS records.
> 
> In cases where userspace tasks are trapped within kernel mode (e.g.,
> during a syscall) when an NMI arrives, pt_regs information can still be
> retrieved from task_pt_regs(). However, capturing SIMD and other
> xsave-based registers in this scenario is challenging. Therefore,
> snapshots for these registers are omitted in such cases.
> 
> The reasons are:
> - Profiling a userspace task that requires SIMD/eGPR registers typically
>   involves NMIs hitting userspace, not kernel mode.
> - Although it is possible to retrieve values when the TIF_NEED_FPU_LOAD
>   flag is set, the complexity introduced to handle this uncommon case in
>   the critical path is not justified.
> - Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
>   Some corner cases, such as an NMI occurring just after the flag switches
>   but still in kernel mode, cannot be handled.

Urgh.. Dave, Thomas, is there any reason we could not set
TIF_NEED_FPU_LOAD *after* doing the XSAVE (clearing is already done
after restore).

That way, when an NMI sees TIF_NEED_FPU_LOAD it knows the task copy is
consistent.

I'm not at all sure this is complex, it just needs a little care.

And then there is the deferred thing, just like unwind, we can defer
REGS_USER/STACK_USER much the same, except someone went and built all
that deferred stuff with unwind all tangled into it :/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-04 15:17   ` Peter Zijlstra
@ 2025-12-04 15:47     ` Peter Zijlstra
  2025-12-05  6:37       ` Mi, Dapeng
  2025-12-04 18:59     ` Dave Hansen
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2025-12-04 15:47 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Thu, Dec 04, 2025 at 04:17:35PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:47PM +0800, Dapeng Mi wrote:
> > From: Kan Liang <kan.liang@linux.intel.com>
> > 
> > While collecting XMM registers in a PEBS record has been supported since
> > Icelake, non-PEBS events have lacked this feature. By leveraging the
> > xsaves instruction, it is now possible to snapshot XMM registers for
> > non-PEBS events, completing the feature set.
> > 
> > To utilize the xsaves instruction, a 64-byte aligned buffer is required.
> > A per-CPU ext_regs_buf is added to store SIMD and other registers, with
> > the buffer size being approximately 2K. The buffer is allocated using
> > kzalloc_node(), ensuring natural alignment and 64-byte alignment for all
> > kmalloc() allocations with powers of 2.
> > 
> > The XMM sampling support is extended for both REGS_USER and REGS_INTR.
> > For REGS_USER, perf_get_regs_user() returns the registers from
> > task_pt_regs(current), which is a pt_regs structure. It needs to be
> > copied to user space secific x86_user_regs structure since kernel may
> > modify pt_regs structure later.
> > 
> > For PEBS, XMM registers are retrieved from PEBS records.
> > 
> > In cases where userspace tasks are trapped within kernel mode (e.g.,
> > during a syscall) when an NMI arrives, pt_regs information can still be
> > retrieved from task_pt_regs(). However, capturing SIMD and other
> > xsave-based registers in this scenario is challenging. Therefore,
> > snapshots for these registers are omitted in such cases.
> > 
> > The reasons are:
> > - Profiling a userspace task that requires SIMD/eGPR registers typically
> >   involves NMIs hitting userspace, not kernel mode.
> > - Although it is possible to retrieve values when the TIF_NEED_FPU_LOAD
> >   flag is set, the complexity introduced to handle this uncommon case in
> >   the critical path is not justified.
> > - Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
> >   Some corner cases, such as an NMI occurring just after the flag switches
> >   but still in kernel mode, cannot be handled.
> 
> Urgh.. Dave, Thomas, is there any reason we could not set
> TIF_NEED_FPU_LOAD *after* doing the XSAVE (clearing is already done
> after restore).
> 
> That way, when an NMI sees TIF_NEED_FPU_LOAD it knows the task copy is
> consistent.
> 
> I'm not at all sure this is complex, it just needs a little care.
> 
> And then there is the deferred thing, just like unwind, we can defer
> REGS_USER/STACK_USER much the same, except someone went and built all
> that deferred stuff with unwind all tangled into it :/

With something like the below, the NMI could do something like:

	struct xregs_state *xr = NULL;

	/*
	 * fpu code does:
	 *  XSAVE
	 *  set_thread_flag(TIF_NEED_FPU_LOAD)
	 *  ...
	 *  XRSTOR
	 *  clear_thread_flag(TIF_NEED_FPU_LOAD)
	 * therefore, when TIF_NEED_FPU_LOAD, the task fpu state holds a
	 * whole copy.
	 */
	if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
		struct fpu *fpu = x86_task_fpu(current);
		/*
		 * If __task_fpstate is set, it holds the right pointer,
		 * otherwise fpstate will.
		 */
		struct fpstate *fps = READ_ONCE(fpu->__task_fpstate);
		if (!fps)
			fps = fpu->fpstate;
		xr = &fps->regs.xregs_state;
	} else {
		/* like fpu_sync_fpstate(), except NMI local */
		xsave_nmi(xr, mask);
	}

	// frob xr into perf data

Or did I miss something? I've not looked at this very long and the above
was very vague on the actual issues.


diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index da233f20ae6f..0f91a0d7e799 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -359,18 +359,22 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 	struct fpstate *cur_fps = fpu->fpstate;
 
 	fpregs_lock();
-	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD))
+	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD)) {
 		save_fpregs_to_fpstate(fpu);
+		set_thread_flag(TIF_NEED_FPU_LOAD);
+	}
 
 	/* Swap fpstate */
 	if (enter_guest) {
-		fpu->__task_fpstate = cur_fps;
+		WRITE_ONCE(fpu->__task_fpstate, cur_fps);
+		barrier();
 		fpu->fpstate = guest_fps;
 		guest_fps->in_use = true;
 	} else {
 		guest_fps->in_use = false;
 		fpu->fpstate = fpu->__task_fpstate;
-		fpu->__task_fpstate = NULL;
+		barrier();
+		WRITE_ONCE(fpu->__task_fpstate, NULL);
 	}
 
 	cur_fps = fpu->fpstate;
@@ -456,8 +460,8 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 
 	if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
 	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
-		set_thread_flag(TIF_NEED_FPU_LOAD);
 		save_fpregs_to_fpstate(x86_task_fpu(current));
+		set_thread_flag(TIF_NEED_FPU_LOAD);
 	}
 	__cpu_invalidate_fpregs_state();
 

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-04  9:20         ` Mi, Dapeng
@ 2025-12-04 16:16           ` Ian Rogers
  2025-12-05  4:00             ` Mi, Dapeng
  0 siblings, 1 reply; 55+ messages in thread
From: Ian Rogers @ 2025-12-04 16:16 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 12/4/2025 3:49 PM, Ian Rogers wrote:
> > On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 12/4/2025 8:17 AM, Ian Rogers wrote:
> >>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>
> >>>> This patch adds support for the newly introduced SIMD register sampling
> >>>> format by adding the following functions:
> >>>>
> >>>> uint64_t arch__intr_simd_reg_mask(void);
> >>>> uint64_t arch__user_simd_reg_mask(void);
> >>>> uint64_t arch__intr_pred_reg_mask(void);
> >>>> uint64_t arch__user_pred_reg_mask(void);
> >>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>
> >>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>>>
> >>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >>>> supported PRED registers, such as OPMASK on x86 platforms.
> >>>>
> >>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >>>> exact bitmap and number of qwords for a specific type of SIMD register.
> >>>> For example, for XMM registers on x86 platforms, the returned bitmap is
> >>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>>>
> >>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >>>> exact bitmap and number of qwords for a specific type of PRED register.
> >>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >>>> OPMASK).
> >>>>
> >>>> Additionally, the function __parse_regs() is enhanced to support parsing
> >>>> these newly introduced SIMD registers. Currently, each type of register
> >>>> can only be sampled collectively; sampling a specific SIMD register is
> >>>> not supported. For example, all XMM registers are sampled together rather
> >>>> than sampling only XMM0.
> >>>>
> >>>> When multiple overlapping register types, such as XMM and YMM, are
> >>>> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>>>
> >>>> With this patch, all supported sampling registers on x86 platforms are
> >>>> displayed as follows.
> >>>>
> >>>>  $perf record -I?
> >>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>
> >>>>  $perf record --user-regs=?
> >>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>
> >>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>> ---
> >>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>>>  tools/perf/util/evsel.c                   |  27 ++
> >>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>>>  tools/perf/util/perf_regs.c               |  59 +++
> >>>>  tools/perf/util/perf_regs.h               |  11 +
> >>>>  tools/perf/util/record.h                  |   6 +
> >>>>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>>>
> >>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >>>> index 12fd93f04802..db41430f3b07 100644
> >>>> --- a/tools/perf/arch/x86/util/perf_regs.c
> >>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >>>> @@ -13,6 +13,49 @@
> >>>>  #include "../../../util/pmu.h"
> >>>>  #include "../../../util/pmus.h"
> >>>>
> >>>> +static const struct sample_reg sample_reg_masks_ext[] = {
> >>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >>>> +#endif
> >>>> +       SMPL_REG_END
> >>>> +};
> >>>> +
> >>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>         SMPL_REG(BX, PERF_REG_X86_BX),
> >>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>>>         return SDT_ARG_VALID;
> >>>>  }
> >>>>
> >>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> >>> To make the code easier to read, it'd be nice to document sample_type,
> >>> qwords and mask here.
> >> Sure.
> >>
> >>
> >>>> +{
> >>>> +       struct perf_event_attr attr = {
> >>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>> +               .sample_type                    = sample_type,
> >>>> +               .disabled                       = 1,
> >>>> +               .exclude_kernel                 = 1,
> >>>> +               .sample_simd_regs_enabled       = 1,
> >>>> +       };
> >>>> +       int fd;
> >>>> +
> >>>> +       attr.sample_period = 1;
> >>>> +
> >>>> +       if (!pred) {
> >>>> +               attr.sample_simd_vec_reg_qwords = qwords;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       attr.sample_simd_vec_reg_intr = mask;
> >>>> +               else
> >>>> +                       attr.sample_simd_vec_reg_user = mask;
> >>>> +       } else {
> >>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >>>> +               else
> >>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >>>> +       }
> >>>> +
> >>>> +       if (perf_pmus__num_core_pmus() > 1) {
> >>>> +               struct perf_pmu *pmu = NULL;
> >>>> +               __u64 type = PERF_TYPE_RAW;
> >>> It should be okay to do:
> >>> __u64 type = perf_pmus__find_core_pmu()->type
> >>> rather than have the whole loop below.
> >> Sure. Thanks.
> >>
> >>
> >>>> +
> >>>> +               /*
> >>>> +                * The same register set is supported among different hybrid PMUs.
> >>>> +                * Only check the first available one.
> >>>> +                */
> >>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >>>> +                       type = pmu->type;
> >>>> +                       break;
> >>>> +               }
> >>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >>>> +       }
> >>>> +
> >>>> +       event_attr_init(&attr);
> >>>> +
> >>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>> +       if (fd != -1) {
> >>>> +               close(fd);
> >>>> +               return true;
> >>>> +       }
> >>>> +
> >>>> +       return false;
> >>>> +}
> >>>> +
> >>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>> +{
> >>>> +       bool supported = false;
> >>>> +       u64 bits;
> >>>> +
> >>>> +       *mask = 0;
> >>>> +       *qwords = 0;
> >>>> +
> >>>> +       switch (reg) {
> >>>> +       case PERF_REG_X86_XMM:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_XMM_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       case PERF_REG_X86_YMM:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_YMM_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       case PERF_REG_X86_ZMM:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >>>> +                       break;
> >>>> +               }
> >>>> +
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       default:
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       return supported;
> >>>> +}
> >>>> +
> >>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>> +{
> >>>> +       bool supported = false;
> >>>> +       u64 bits;
> >>>> +
> >>>> +       *mask = 0;
> >>>> +       *qwords = 0;
> >>>> +
> >>>> +       switch (reg) {
> >>>> +       case PERF_REG_X86_OPMASK:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       default:
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       return supported;
> >>>> +}
> >>>> +
> >>>> +static bool has_cap_simd_regs(void)
> >>>> +{
> >>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >>>> +       static bool has_cap_simd_regs;
> >>>> +       static bool cached;
> >>>> +
> >>>> +       if (cached)
> >>>> +               return has_cap_simd_regs;
> >>>> +
> >>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>> +       cached = true;
> >>>> +
> >>>> +       return has_cap_simd_regs;
> >>>> +}
> >>>> +
> >>>> +bool arch_has_simd_regs(u64 mask)
> >>>> +{
> >>>> +       return has_cap_simd_regs() &&
> >>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >>>> +}
> >>>> +
> >>>> +static const struct sample_reg sample_simd_reg_masks[] = {
> >>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >>>> +       SMPL_REG_END
> >>>> +};
> >>>> +
> >>>> +static const struct sample_reg sample_pred_reg_masks[] = {
> >>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >>>> +       SMPL_REG_END
> >>>> +};
> >>>> +
> >>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >>>> +{
> >>>> +       return sample_simd_reg_masks;
> >>>> +}
> >>>> +
> >>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >>>> +{
> >>>> +       return sample_pred_reg_masks;
> >>>> +}
> >>>> +
> >>>> +static bool x86_intr_simd_updated;
> >>>> +static u64 x86_intr_simd_reg_mask;
> >>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>> Could we add some comments? I can kind of figure out the updated is a
> >>> check for lazy initialization and what masks are, qwords is an odd
> >>> one. The comment could also point out that SIMD doesn't mean the
> >>> machine supports SIMD, but SIMD registers are supported in perf
> >>> events.
> >> Sure.
> >>
> >>
> >>>> +static bool x86_user_simd_updated;
> >>>> +static u64 x86_user_simd_reg_mask;
> >>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>> +
> >>>> +static bool x86_intr_pred_updated;
> >>>> +static u64 x86_intr_pred_reg_mask;
> >>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +static bool x86_user_pred_updated;
> >>>> +static u64 x86_user_pred_reg_mask;
> >>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +
> >>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool supported;
> >>>> +       u64 mask = 0;
> >>>> +       int reg;
> >>>> +
> >>>> +       if (!has_cap_simd_regs())
> >>>> +               return 0;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >>>> +               return x86_intr_simd_reg_mask;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >>>> +               return x86_user_simd_reg_mask;
> >>>> +
> >>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>> +               supported = false;
> >>>> +
> >>>> +               if (!r->mask)
> >>>> +                       continue;
> >>>> +               reg = fls64(r->mask) - 1;
> >>>> +
> >>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >>>> +                       break;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_intr_simd_mask[reg],
> >>>> +                                                        &x86_intr_simd_qwords[reg]);
> >>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_user_simd_mask[reg],
> >>>> +                                                        &x86_user_simd_qwords[reg]);
> >>>> +               if (supported)
> >>>> +                       mask |= BIT_ULL(reg);
> >>>> +       }
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>> +               x86_intr_simd_reg_mask = mask;
> >>>> +               x86_intr_simd_updated = true;
> >>>> +       } else {
> >>>> +               x86_user_simd_reg_mask = mask;
> >>>> +               x86_user_simd_updated = true;
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>>> +
> >>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool supported;
> >>>> +       u64 mask = 0;
> >>>> +       int reg;
> >>>> +
> >>>> +       if (!has_cap_simd_regs())
> >>>> +               return 0;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >>>> +               return x86_intr_pred_reg_mask;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >>>> +               return x86_user_pred_reg_mask;
> >>>> +
> >>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>> +               supported = false;
> >>>> +
> >>>> +               if (!r->mask)
> >>>> +                       continue;
> >>>> +               reg = fls64(r->mask) - 1;
> >>>> +
> >>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >>>> +                       break;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_intr_pred_mask[reg],
> >>>> +                                                        &x86_intr_pred_qwords[reg]);
> >>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_user_pred_mask[reg],
> >>>> +                                                        &x86_user_pred_qwords[reg]);
> >>>> +               if (supported)
> >>>> +                       mask |= BIT_ULL(reg);
> >>>> +       }
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>> +               x86_intr_pred_reg_mask = mask;
> >>>> +               x86_intr_pred_updated = true;
> >>>> +       } else {
> >>>> +               x86_user_pred_reg_mask = mask;
> >>>> +               x86_user_pred_updated = true;
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>> This feels repetitive with __arch__simd_reg_mask, could they be
> >>> refactored together?
> >> hmm, it looks we can extract the for loop as a common function. The other
> >> parts are hard to be generalized since they are manipulating different
> >> variables. If we want to generalize them, we have to introduce lots of "if
> >> ... else" branches and that would make code hard to be read.
> >>
> >>
> >>>> +
> >>>> +uint64_t arch__intr_simd_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_simd_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_pred_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_pred_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>> +}
> >>>> +
> >>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>> +{
> >>>> +       uint64_t mask = 0;
> >>>> +
> >>>> +       *qwords = 0;
> >>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >>>> +               if (intr) {
> >>>> +                       *qwords = x86_intr_simd_qwords[reg];
> >>>> +                       mask = x86_intr_simd_mask[reg];
> >>>> +               } else {
> >>>> +                       *qwords = x86_user_simd_qwords[reg];
> >>>> +                       mask = x86_user_simd_mask[reg];
> >>>> +               }
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>>> +
> >>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>> +{
> >>>> +       uint64_t mask = 0;
> >>>> +
> >>>> +       *qwords = 0;
> >>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >>>> +               if (intr) {
> >>>> +                       *qwords = x86_intr_pred_qwords[reg];
> >>>> +                       mask = x86_intr_pred_mask[reg];
> >>>> +               } else {
> >>>> +                       *qwords = x86_user_pred_qwords[reg];
> >>>> +                       mask = x86_user_pred_mask[reg];
> >>>> +               }
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_intr_simd_updated)
> >>>> +               arch__intr_simd_reg_mask();
> >>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_user_simd_updated)
> >>>> +               arch__user_simd_reg_mask();
> >>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_intr_pred_updated)
> >>>> +               arch__intr_pred_reg_mask();
> >>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_user_pred_updated)
> >>>> +               arch__user_pred_reg_mask();
> >>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >>>> +}
> >>>> +
> >>>>  const struct sample_reg *arch__sample_reg_masks(void)
> >>>>  {
> >>>> +       if (has_cap_simd_regs())
> >>>> +               return sample_reg_masks_ext;
> >>>>         return sample_reg_masks;
> >>>>  }
> >>>>
> >>>> -uint64_t arch__intr_reg_mask(void)
> >>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>>>  {
> >>>>         struct perf_event_attr attr = {
> >>>> -               .type                   = PERF_TYPE_HARDWARE,
> >>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >>>> -               .precise_ip             = 1,
> >>>> -               .disabled               = 1,
> >>>> -               .exclude_kernel         = 1,
> >>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>> +               .sample_type                    = sample_type,
> >>>> +               .precise_ip                     = 1,
> >>>> +               .disabled                       = 1,
> >>>> +               .exclude_kernel                 = 1,
> >>>> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>>>         };
> >>>>         int fd;
> >>>>         /*
> >>>>          * In an unnamed union, init it here to build on older gcc versions
> >>>>          */
> >>>>         attr.sample_period = 1;
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +               attr.sample_regs_intr = mask;
> >>>> +       else
> >>>> +               attr.sample_regs_user = mask;
> >>>>
> >>>>         if (perf_pmus__num_core_pmus() > 1) {
> >>>>                 struct perf_pmu *pmu = NULL;
> >>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>         if (fd != -1) {
> >>>>                 close(fd);
> >>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >>>> +               return mask;
> >>>>         }
> >>>>
> >>>> -       return PERF_REGS_MASK;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_reg_mask(void)
> >>>> +{
> >>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>> +
> >>>> +       if (has_cap_simd_regs()) {
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>> +                                        true);
> >>> It's nice to label constant arguments like this something like:
> >>> /*has_simd_regs=*/true);
> >>>
> >>> Tools like clang-tidy even try to enforce the argument names match the comments.
> >> Sure.
> >>
> >>
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>> +                                        true);
> >>>> +       } else
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >>>> +
> >>>> +       return mask;
> >>>>  }
> >>>>
> >>>>  uint64_t arch__user_reg_mask(void)
> >>>>  {
> >>>> -       return PERF_REGS_MASK;
> >>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>> +
> >>>> +       if (has_cap_simd_regs()) {
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>> +                                        true);
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>> +                                        true);
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>> The code is repetitive here, could we refactor into a single function
> >>> passing in a user or instr value?
> >> Sure. Would extract the common part.
> >>
> >>
> >>>>  }
> >>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>> index 56ebefd075f2..5d1d90cf9488 100644
> >>>> --- a/tools/perf/util/evsel.c
> >>>> +++ b/tools/perf/util/evsel.c
> >>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>>>             !evsel__is_dummy_event(evsel)) {
> >>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >>>> +       }
> >>>> +
> >>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >>>> +               if (opts->sample_pred_regs_qwords)
> >>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>> +               else
> >>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>         }
> >>>>
> >>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>>>             !evsel__is_dummy_event(evsel)) {
> >>>>                 attr->sample_regs_user |= opts->sample_user_regs;
> >>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >>>> +               evsel__set_sample_bit(evsel, REGS_USER);
> >>>> +       }
> >>>> +
> >>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>> +               if (opts->sample_pred_regs_qwords)
> >>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>> +               else
> >>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>>>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>>>         }
> >>>>
> >>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >>>> index cda1c620968e..0bd100392889 100644
> >>>> --- a/tools/perf/util/parse-regs-options.c
> >>>> +++ b/tools/perf/util/parse-regs-options.c
> >>>> @@ -4,19 +4,139 @@
> >>>>  #include <stdint.h>
> >>>>  #include <string.h>
> >>>>  #include <stdio.h>
> >>>> +#include <linux/bitops.h>
> >>>>  #include "util/debug.h"
> >>>>  #include <subcmd/parse-options.h>
> >>>>  #include "util/perf_regs.h"
> >>>>  #include "util/parse-regs-options.h"
> >>>> +#include "record.h"
> >>>> +
> >>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       if (!simd_mask)
> >>>> +               return;
> >>>> +
> >>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>> +               if (!(r->mask & simd_mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               if (bitmap)
> >>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>> +       }
> >>>> +}
> >>>> +
> >>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       if (!pred_mask)
> >>>> +               return;
> >>>> +
> >>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>> +               if (!(r->mask & pred_mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               if (bitmap)
> >>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>> +       }
> >>>> +}
> >>>> +
> >>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool matched = false;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>> +               if (strcasecmp(s, r->name))
> >>>> +                       continue;
> >>>> +               if (!fls64(r->mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               matched = true;
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       /* Just need the highest qwords */
> >>> I'm not following here. Does the bitmap need to handle gaps?
> >> Currently no. In theory, the kernel supports user space only samples a
> >> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
> >> supports 16 XMM registers on XMM), but it's not supported to avoid
> >> introducing too much complexity in perf tools. Moreover, I don't think end
> >> users have such requirement. In most cases, users should only know which
> >> kinds of SIMD registers their programs use but usually doesn't know and
> >> care about which exact SIMD register is used.
> >>
> >>
> >>>> +       if (qwords > opts->sample_vec_regs_qwords) {
> >>>> +               opts->sample_vec_regs_qwords = qwords;
> >>>> +               if (intr)
> >>>> +                       opts->sample_intr_vec_regs = bitmap;
> >>>> +               else
> >>>> +                       opts->sample_user_vec_regs = bitmap;
> >>>> +       }
> >>>> +
> >>>> +       return matched;
> >>>> +}
> >>>> +
> >>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool matched = false;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>> +               if (strcasecmp(s, r->name))
> >>>> +                       continue;
> >>>> +               if (!fls64(r->mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               matched = true;
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       /* Just need the highest qwords */
> >>> Again repetitive, could we have a single function?
> >> Yes, I suppose the for loop at least can be extracted as a common function.
> >>
> >>
> >>>> +       if (qwords > opts->sample_pred_regs_qwords) {
> >>>> +               opts->sample_pred_regs_qwords = qwords;
> >>>> +               if (intr)
> >>>> +                       opts->sample_intr_pred_regs = bitmap;
> >>>> +               else
> >>>> +                       opts->sample_user_pred_regs = bitmap;
> >>>> +       }
> >>>> +
> >>>> +       return matched;
> >>>> +}
> >>>>
> >>>>  static int
> >>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>  {
> >>>>         uint64_t *mode = (uint64_t *)opt->value;
> >>>>         const struct sample_reg *r = NULL;
> >>>> +       struct record_opts *opts;
> >>>>         char *s, *os = NULL, *p;
> >>>> -       int ret = -1;
> >>>> +       bool has_simd_regs = false;
> >>>>         uint64_t mask;
> >>>> +       uint64_t simd_mask;
> >>>> +       uint64_t pred_mask;
> >>>> +       int ret = -1;
> >>>>
> >>>>         if (unset)
> >>>>                 return 0;
> >>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>         if (*mode)
> >>>>                 return -1;
> >>>>
> >>>> -       if (intr)
> >>>> +       if (intr) {
> >>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>>>                 mask = arch__intr_reg_mask();
> >>>> -       else
> >>>> +               simd_mask = arch__intr_simd_reg_mask();
> >>>> +               pred_mask = arch__intr_pred_reg_mask();
> >>>> +       } else {
> >>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>>>                 mask = arch__user_reg_mask();
> >>>> +               simd_mask = arch__user_simd_reg_mask();
> >>>> +               pred_mask = arch__user_pred_reg_mask();
> >>>> +       }
> >>>>
> >>>>         /* str may be NULL in case no arg is passed to -I */
> >>>>         if (str) {
> >>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>                                         if (r->mask & mask)
> >>>>                                                 fprintf(stderr, "%s ", r->name);
> >>>>                                 }
> >>>> +                               __print_simd_regs(intr, simd_mask);
> >>>> +                               __print_pred_regs(intr, pred_mask);
> >>>>                                 fputc('\n', stderr);
> >>>>                                 /* just printing available regs */
> >>>>                                 goto error;
> >>>>                         }
> >>>> +
> >>>> +                       if (simd_mask) {
> >>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >>>> +                               if (has_simd_regs)
> >>>> +                                       goto next;
> >>>> +                       }
> >>>> +                       if (pred_mask) {
> >>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >>>> +                               if (has_simd_regs)
> >>>> +                                       goto next;
> >>>> +                       }
> >>>> +
> >>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>>>                                         break;
> >>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>                         }
> >>>>
> >>>>                         *mode |= r->mask;
> >>>> -
> >>>> +next:
> >>>>                         if (!p)
> >>>>                                 break;
> >>>>
> >>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>         ret = 0;
> >>>>
> >>>>         /* default to all possible regs */
> >>>> -       if (*mode == 0)
> >>>> +       if (*mode == 0 && !has_simd_regs)
> >>>>                 *mode = mask;
> >>>>  error:
> >>>>         free(os);
> >>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >>>> index 66b666d9ce64..fb0366d050cf 100644
> >>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>>>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>>>         PRINT_ATTRf(aux_resume, p_unsigned);
> >>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>>>
> >>>>         return ret;
> >>>>  }
> >>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >>>> index 44b90bbf2d07..e8a9fabc92e6 100644
> >>>> --- a/tools/perf/util/perf_regs.c
> >>>> +++ b/tools/perf/util/perf_regs.c
> >>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>>>         return SDT_ARG_SKIP;
> >>>>  }
> >>>>
> >>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >>>> +{
> >>>> +       return false;
> >>>> +}
> >>>> +
> >>>>  uint64_t __weak arch__intr_reg_mask(void)
> >>>>  {
> >>>>         return 0;
> >>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>>>         return 0;
> >>>>  }
> >>>>
> >>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_simd_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_pred_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>         SMPL_REG_END
> >>>>  };
> >>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>>>         return sample_reg_masks;
> >>>>  }
> >>>>
> >>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >>>> +{
> >>>> +       return sample_reg_masks;
> >>>> +}
> >>>> +
> >>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >>>> +{
> >>>> +       return sample_reg_masks;
> >>>> +}
> >>> Thinking out loud. I wonder if there is a way to hide the weak
> >>> functions. It seems the support is tied to PMUs, particularly core
> >>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
> >>> ask the PMU to parse the register strings, set up the perf_event_attr,
> >>> etc. I'm somewhat scared these functions will be used on the report
> >>> rather than record side of things, thereby breaking perf.data support
> >>> when the host kernel does or doesn't have the SIMD support.
> >> Ian, I don't quite follow your words.
> >>
> >> I don't quite understand how should we do for "push things into pmu and
> >> arch pmu code". Current SIMD registers support follows the same way of the
> >> general registers support. If we intend to change the way entirely, we'd
> >> better have an independent patch-set.
> >>
> >> why these functions would break the perf.data repport? perf-report would
> >> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
> >> the flags is set (indicates there are SIMD registers data appended in the
> >> record), perf-report would try to parse the SIMD registers data.
> > Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
> > remove weak symbols like:
> > https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
> >
> > For these patches what I'm imagining is that there is a Nova Lake
> > generated perf.data file. Using perf report, script, etc. on the Nova
> > Lake should expose all of the same mask, qword, etc. values as when
> > the perf.data was generated and so things will work. If the perf.data
> > file was taken to say my Alderlake then what will happen? Generally
> > using the arch directory and weak symbols is a code smell that cross
> > platform things are going to break - there should be sufficient data
> > in the event and the perf_event_attr to fully decode what's going on.
> > Sometimes tying things to a PMU name can avoid the use of the arch
> > directory. We were able to avoid the arch directory to a good extent
> > for the TPEBS code, even though it is a very modern Intel feature.
>
> I see.
>
> But the sampling support for SIMD registers is different with the sample
> weight processing in the patch
> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
> Each arch may support different kinds of SIMD registers and furthermore
> each kind of SIMD register may have different register number and register
> width. It's quite hard to figure out some common functions or fields to
> represent the name and attributes of these arch-specific SIMD registers.
> These arch-specific information can only be told by the arch-specific code.
> So it looks the __weak functions are still the easiest way to implement this.
>
> I don't think the perf.data parsing would be broken from a platform to
> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
> To indicates the presence of SIMD registers in record data, a new ABI flag
> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
> platform is new enough and can recognize this new flag, then the SIMD
> registers data would be parsed correctly. Even though the perf tool is old
> and have no support of SIMD register, the data of SIMD registers would just
> be silently ignored and should not break the parsing.

That's good to know. I'm confused then why these functions can't just
be within the arch directory? For example, we don't expose the
intel-pt PMU code in the common code except for the parsing parts. A
lot of that is handled by the default perf_event_attr initialization
that every PMU can have its own variant of:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123

Perhaps this is all just evidence of tech debt in the perf_regs.c code
:-/ The bit that's relevant to the patch here is that I think this is
adding to the tech debt problem as 11 more functions are added to
perf_regs.h.

Thanks,
Ian

> >
> > Thanks,
> > Ian
> >
> >
> >
> >>> Thanks,
> >>> Ian
> >>>
> >>>> +
> >>>>  const char *perf_reg_name(int id, const char *arch)
> >>>>  {
> >>>>         const char *reg_name = NULL;
> >>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >>>> index f2d0736d65cc..bce9c4cfd1bf 100644
> >>>> --- a/tools/perf/util/perf_regs.h
> >>>> +++ b/tools/perf/util/perf_regs.h
> >>>> @@ -24,9 +24,20 @@ enum {
> >>>>  };
> >>>>
> >>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >>>> +bool arch_has_simd_regs(u64 mask);
> >>>>  uint64_t arch__intr_reg_mask(void);
> >>>>  uint64_t arch__user_reg_mask(void);
> >>>>  const struct sample_reg *arch__sample_reg_masks(void);
> >>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> >>>> +uint64_t arch__intr_simd_reg_mask(void);
> >>>> +uint64_t arch__user_simd_reg_mask(void);
> >>>> +uint64_t arch__intr_pred_reg_mask(void);
> >>>> +uint64_t arch__user_pred_reg_mask(void);
> >>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>
> >>>>  const char *perf_reg_name(int id, const char *arch);
> >>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >>>> index ea3a6c4657ee..825ffb4cc53f 100644
> >>>> --- a/tools/perf/util/record.h
> >>>> +++ b/tools/perf/util/record.h
> >>>> @@ -59,7 +59,13 @@ struct record_opts {
> >>>>         unsigned int  user_freq;
> >>>>         u64           branch_stack;
> >>>>         u64           sample_intr_regs;
> >>>> +       u64           sample_intr_vec_regs;
> >>>>         u64           sample_user_regs;
> >>>> +       u64           sample_user_vec_regs;
> >>>> +       u16           sample_pred_regs_qwords;
> >>>> +       u16           sample_vec_regs_qwords;
> >>>> +       u16           sample_intr_pred_regs;
> >>>> +       u16           sample_user_pred_regs;
> >>>>         u64           default_interval;
> >>>>         u64           user_interval;
> >>>>         size_t        auxtrace_snapshot_size;
> >>>> --
> >>>> 2.34.1
> >>>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-04 15:17   ` Peter Zijlstra
  2025-12-04 15:47     ` Peter Zijlstra
@ 2025-12-04 18:59     ` Dave Hansen
  2025-12-05  8:42       ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2025-12-04 18:59 UTC (permalink / raw)
  To: Peter Zijlstra, Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

[-- Attachment #1: Type: text/plain, Size: 674 bytes --]

On 12/4/25 07:17, Peter Zijlstra wrote:
>> - Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
>>   Some corner cases, such as an NMI occurring just after the flag switches
>>   but still in kernel mode, cannot be handled.
> Urgh.. Dave, Thomas, is there any reason we could not set
> TIF_NEED_FPU_LOAD *after* doing the XSAVE (clearing is already done
> after restore).
> 
> That way, when an NMI sees TIF_NEED_FPU_LOAD it knows the task copy is
> consistent.

Something like the attached patch?

I think that would be just fine. save_fpregs_to_fpstate() doesn't
actually change the need for TIF_NEED_FPU_LOAD, so I don't think the
ordering matters.

[-- Attachment #2: tif-after-xsave.patch --]
[-- Type: text/x-patch, Size: 634 bytes --]

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index 89004f4ca208..2d57a7bf5406 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -36,8 +36,8 @@ static inline void switch_fpu(struct task_struct *old, int cpu)
 	    !(old->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		struct fpu *old_fpu = x86_task_fpu(old);
 
-		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
 		save_fpregs_to_fpstate(old_fpu);
+		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
 		/*
 		 * The save operation preserved register state, so the
 		 * fpu_fpregs_owner_ctx is still @old_fpu. Store the

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-04 16:16           ` Ian Rogers
@ 2025-12-05  4:00             ` Mi, Dapeng
  2025-12-05  6:38               ` Ian Rogers
  0 siblings, 1 reply; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-05  4:00 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 12:16 AM, Ian Rogers wrote:
> On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 12/4/2025 3:49 PM, Ian Rogers wrote:
>>> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>
>>>>>> This patch adds support for the newly introduced SIMD register sampling
>>>>>> format by adding the following functions:
>>>>>>
>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>
>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>>>
>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>>>
>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>>>
>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>>>> OPMASK).
>>>>>>
>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>>>> these newly introduced SIMD registers. Currently, each type of register
>>>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>>>> not supported. For example, all XMM registers are sampled together rather
>>>>>> than sampling only XMM0.
>>>>>>
>>>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>>>
>>>>>> With this patch, all supported sampling registers on x86 platforms are
>>>>>> displayed as follows.
>>>>>>
>>>>>>  $perf record -I?
>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>
>>>>>>  $perf record --user-regs=?
>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>
>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>> ---
>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>>>  tools/perf/util/record.h                  |   6 +
>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>>>> index 12fd93f04802..db41430f3b07 100644
>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>>>> @@ -13,6 +13,49 @@
>>>>>>  #include "../../../util/pmu.h"
>>>>>>  #include "../../../util/pmus.h"
>>>>>>
>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>>>> +#endif
>>>>>> +       SMPL_REG_END
>>>>>> +};
>>>>>> +
>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>>>         return SDT_ARG_VALID;
>>>>>>  }
>>>>>>
>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>>>> To make the code easier to read, it'd be nice to document sample_type,
>>>>> qwords and mask here.
>>>> Sure.
>>>>
>>>>
>>>>>> +{
>>>>>> +       struct perf_event_attr attr = {
>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>> +               .sample_type                    = sample_type,
>>>>>> +               .disabled                       = 1,
>>>>>> +               .exclude_kernel                 = 1,
>>>>>> +               .sample_simd_regs_enabled       = 1,
>>>>>> +       };
>>>>>> +       int fd;
>>>>>> +
>>>>>> +       attr.sample_period = 1;
>>>>>> +
>>>>>> +       if (!pred) {
>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>>>> +               else
>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>>>> +       } else {
>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>>>> +               else
>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>>>> +       }
>>>>>> +
>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>>>> +               struct perf_pmu *pmu = NULL;
>>>>>> +               __u64 type = PERF_TYPE_RAW;
>>>>> It should be okay to do:
>>>>> __u64 type = perf_pmus__find_core_pmu()->type
>>>>> rather than have the whole loop below.
>>>> Sure. Thanks.
>>>>
>>>>
>>>>>> +
>>>>>> +               /*
>>>>>> +                * The same register set is supported among different hybrid PMUs.
>>>>>> +                * Only check the first available one.
>>>>>> +                */
>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>>>> +                       type = pmu->type;
>>>>>> +                       break;
>>>>>> +               }
>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>>>> +       }
>>>>>> +
>>>>>> +       event_attr_init(&attr);
>>>>>> +
>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>> +       if (fd != -1) {
>>>>>> +               close(fd);
>>>>>> +               return true;
>>>>>> +       }
>>>>>> +
>>>>>> +       return false;
>>>>>> +}
>>>>>> +
>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>> +{
>>>>>> +       bool supported = false;
>>>>>> +       u64 bits;
>>>>>> +
>>>>>> +       *mask = 0;
>>>>>> +       *qwords = 0;
>>>>>> +
>>>>>> +       switch (reg) {
>>>>>> +       case PERF_REG_X86_XMM:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       case PERF_REG_X86_YMM:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       case PERF_REG_X86_ZMM:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>>>> +                       break;
>>>>>> +               }
>>>>>> +
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       default:
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       return supported;
>>>>>> +}
>>>>>> +
>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>> +{
>>>>>> +       bool supported = false;
>>>>>> +       u64 bits;
>>>>>> +
>>>>>> +       *mask = 0;
>>>>>> +       *qwords = 0;
>>>>>> +
>>>>>> +       switch (reg) {
>>>>>> +       case PERF_REG_X86_OPMASK:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       default:
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       return supported;
>>>>>> +}
>>>>>> +
>>>>>> +static bool has_cap_simd_regs(void)
>>>>>> +{
>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>>>> +       static bool has_cap_simd_regs;
>>>>>> +       static bool cached;
>>>>>> +
>>>>>> +       if (cached)
>>>>>> +               return has_cap_simd_regs;
>>>>>> +
>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>> +       cached = true;
>>>>>> +
>>>>>> +       return has_cap_simd_regs;
>>>>>> +}
>>>>>> +
>>>>>> +bool arch_has_simd_regs(u64 mask)
>>>>>> +{
>>>>>> +       return has_cap_simd_regs() &&
>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>>>> +}
>>>>>> +
>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>>>> +       SMPL_REG_END
>>>>>> +};
>>>>>> +
>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>>>> +       SMPL_REG_END
>>>>>> +};
>>>>>> +
>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_simd_reg_masks;
>>>>>> +}
>>>>>> +
>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_pred_reg_masks;
>>>>>> +}
>>>>>> +
>>>>>> +static bool x86_intr_simd_updated;
>>>>>> +static u64 x86_intr_simd_reg_mask;
>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>> Could we add some comments? I can kind of figure out the updated is a
>>>>> check for lazy initialization and what masks are, qwords is an odd
>>>>> one. The comment could also point out that SIMD doesn't mean the
>>>>> machine supports SIMD, but SIMD registers are supported in perf
>>>>> events.
>>>> Sure.
>>>>
>>>>
>>>>>> +static bool x86_user_simd_updated;
>>>>>> +static u64 x86_user_simd_reg_mask;
>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>> +
>>>>>> +static bool x86_intr_pred_updated;
>>>>>> +static u64 x86_intr_pred_reg_mask;
>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +static bool x86_user_pred_updated;
>>>>>> +static u64 x86_user_pred_reg_mask;
>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +
>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool supported;
>>>>>> +       u64 mask = 0;
>>>>>> +       int reg;
>>>>>> +
>>>>>> +       if (!has_cap_simd_regs())
>>>>>> +               return 0;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>>>> +               return x86_intr_simd_reg_mask;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>>>> +               return x86_user_simd_reg_mask;
>>>>>> +
>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>> +               supported = false;
>>>>>> +
>>>>>> +               if (!r->mask)
>>>>>> +                       continue;
>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>> +
>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>>>> +                       break;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_intr_simd_mask[reg],
>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_user_simd_mask[reg],
>>>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>>>> +               if (supported)
>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>> +       }
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>> +               x86_intr_simd_reg_mask = mask;
>>>>>> +               x86_intr_simd_updated = true;
>>>>>> +       } else {
>>>>>> +               x86_user_simd_reg_mask = mask;
>>>>>> +               x86_user_simd_updated = true;
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>>> +
>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool supported;
>>>>>> +       u64 mask = 0;
>>>>>> +       int reg;
>>>>>> +
>>>>>> +       if (!has_cap_simd_regs())
>>>>>> +               return 0;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>>>> +               return x86_intr_pred_reg_mask;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>>>> +               return x86_user_pred_reg_mask;
>>>>>> +
>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>> +               supported = false;
>>>>>> +
>>>>>> +               if (!r->mask)
>>>>>> +                       continue;
>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>> +
>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>>>> +                       break;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_intr_pred_mask[reg],
>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_user_pred_mask[reg],
>>>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>>>> +               if (supported)
>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>> +       }
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>> +               x86_intr_pred_reg_mask = mask;
>>>>>> +               x86_intr_pred_updated = true;
>>>>>> +       } else {
>>>>>> +               x86_user_pred_reg_mask = mask;
>>>>>> +               x86_user_pred_updated = true;
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>> This feels repetitive with __arch__simd_reg_mask, could they be
>>>>> refactored together?
>>>> hmm, it looks we can extract the for loop as a common function. The other
>>>> parts are hard to be generalized since they are manipulating different
>>>> variables. If we want to generalize them, we have to introduce lots of "if
>>>> ... else" branches and that would make code hard to be read.
>>>>
>>>>
>>>>>> +
>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>> +}
>>>>>> +
>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>> +{
>>>>>> +       uint64_t mask = 0;
>>>>>> +
>>>>>> +       *qwords = 0;
>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>>>> +               if (intr) {
>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>>>> +                       mask = x86_intr_simd_mask[reg];
>>>>>> +               } else {
>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>>>> +                       mask = x86_user_simd_mask[reg];
>>>>>> +               }
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>>> +
>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>> +{
>>>>>> +       uint64_t mask = 0;
>>>>>> +
>>>>>> +       *qwords = 0;
>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>>>> +               if (intr) {
>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>>>> +                       mask = x86_intr_pred_mask[reg];
>>>>>> +               } else {
>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>>>> +                       mask = x86_user_pred_mask[reg];
>>>>>> +               }
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_intr_simd_updated)
>>>>>> +               arch__intr_simd_reg_mask();
>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_user_simd_updated)
>>>>>> +               arch__user_simd_reg_mask();
>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_intr_pred_updated)
>>>>>> +               arch__intr_pred_reg_mask();
>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_user_pred_updated)
>>>>>> +               arch__user_pred_reg_mask();
>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>>>> +}
>>>>>> +
>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>>>  {
>>>>>> +       if (has_cap_simd_regs())
>>>>>> +               return sample_reg_masks_ext;
>>>>>>         return sample_reg_masks;
>>>>>>  }
>>>>>>
>>>>>> -uint64_t arch__intr_reg_mask(void)
>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>>>  {
>>>>>>         struct perf_event_attr attr = {
>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>>>> -               .precise_ip             = 1,
>>>>>> -               .disabled               = 1,
>>>>>> -               .exclude_kernel         = 1,
>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>> +               .sample_type                    = sample_type,
>>>>>> +               .precise_ip                     = 1,
>>>>>> +               .disabled                       = 1,
>>>>>> +               .exclude_kernel                 = 1,
>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>>>         };
>>>>>>         int fd;
>>>>>>         /*
>>>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>>>          */
>>>>>>         attr.sample_period = 1;
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +               attr.sample_regs_intr = mask;
>>>>>> +       else
>>>>>> +               attr.sample_regs_user = mask;
>>>>>>
>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>>>                 struct perf_pmu *pmu = NULL;
>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>         if (fd != -1) {
>>>>>>                 close(fd);
>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>>>> +               return mask;
>>>>>>         }
>>>>>>
>>>>>> -       return PERF_REGS_MASK;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_reg_mask(void)
>>>>>> +{
>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>> +
>>>>>> +       if (has_cap_simd_regs()) {
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>> +                                        true);
>>>>> It's nice to label constant arguments like this something like:
>>>>> /*has_simd_regs=*/true);
>>>>>
>>>>> Tools like clang-tidy even try to enforce the argument names match the comments.
>>>> Sure.
>>>>
>>>>
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>> +                                        true);
>>>>>> +       } else
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>>>> +
>>>>>> +       return mask;
>>>>>>  }
>>>>>>
>>>>>>  uint64_t arch__user_reg_mask(void)
>>>>>>  {
>>>>>> -       return PERF_REGS_MASK;
>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>> +
>>>>>> +       if (has_cap_simd_regs()) {
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>> +                                        true);
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>> +                                        true);
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>> The code is repetitive here, could we refactor into a single function
>>>>> passing in a user or instr value?
>>>> Sure. Would extract the common part.
>>>>
>>>>
>>>>>>  }
>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>>>> --- a/tools/perf/util/evsel.c
>>>>>> +++ b/tools/perf/util/evsel.c
>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>> +       }
>>>>>> +
>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>> +               else
>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>         }
>>>>>>
>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>>>> +       }
>>>>>> +
>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>> +               else
>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>         }
>>>>>>
>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>>>> index cda1c620968e..0bd100392889 100644
>>>>>> --- a/tools/perf/util/parse-regs-options.c
>>>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>>>> @@ -4,19 +4,139 @@
>>>>>>  #include <stdint.h>
>>>>>>  #include <string.h>
>>>>>>  #include <stdio.h>
>>>>>> +#include <linux/bitops.h>
>>>>>>  #include "util/debug.h"
>>>>>>  #include <subcmd/parse-options.h>
>>>>>>  #include "util/perf_regs.h"
>>>>>>  #include "util/parse-regs-options.h"
>>>>>> +#include "record.h"
>>>>>> +
>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       if (!simd_mask)
>>>>>> +               return;
>>>>>> +
>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>> +               if (!(r->mask & simd_mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               if (bitmap)
>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>> +       }
>>>>>> +}
>>>>>> +
>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       if (!pred_mask)
>>>>>> +               return;
>>>>>> +
>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>> +               if (!(r->mask & pred_mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               if (bitmap)
>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>> +       }
>>>>>> +}
>>>>>> +
>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool matched = false;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>> +               if (strcasecmp(s, r->name))
>>>>>> +                       continue;
>>>>>> +               if (!fls64(r->mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               matched = true;
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       /* Just need the highest qwords */
>>>>> I'm not following here. Does the bitmap need to handle gaps?
>>>> Currently no. In theory, the kernel supports user space only samples a
>>>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
>>>> supports 16 XMM registers on XMM), but it's not supported to avoid
>>>> introducing too much complexity in perf tools. Moreover, I don't think end
>>>> users have such requirement. In most cases, users should only know which
>>>> kinds of SIMD registers their programs use but usually doesn't know and
>>>> care about which exact SIMD register is used.
>>>>
>>>>
>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>>>> +               if (intr)
>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>>>> +               else
>>>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>>>> +       }
>>>>>> +
>>>>>> +       return matched;
>>>>>> +}
>>>>>> +
>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool matched = false;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>> +               if (strcasecmp(s, r->name))
>>>>>> +                       continue;
>>>>>> +               if (!fls64(r->mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               matched = true;
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       /* Just need the highest qwords */
>>>>> Again repetitive, could we have a single function?
>>>> Yes, I suppose the for loop at least can be extracted as a common function.
>>>>
>>>>
>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>>>> +               if (intr)
>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>>>> +               else
>>>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>>>> +       }
>>>>>> +
>>>>>> +       return matched;
>>>>>> +}
>>>>>>
>>>>>>  static int
>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>  {
>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>>>         const struct sample_reg *r = NULL;
>>>>>> +       struct record_opts *opts;
>>>>>>         char *s, *os = NULL, *p;
>>>>>> -       int ret = -1;
>>>>>> +       bool has_simd_regs = false;
>>>>>>         uint64_t mask;
>>>>>> +       uint64_t simd_mask;
>>>>>> +       uint64_t pred_mask;
>>>>>> +       int ret = -1;
>>>>>>
>>>>>>         if (unset)
>>>>>>                 return 0;
>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>         if (*mode)
>>>>>>                 return -1;
>>>>>>
>>>>>> -       if (intr)
>>>>>> +       if (intr) {
>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>>>                 mask = arch__intr_reg_mask();
>>>>>> -       else
>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>>>> +       } else {
>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>>>                 mask = arch__user_reg_mask();
>>>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>>>> +       }
>>>>>>
>>>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>>>         if (str) {
>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>                                         if (r->mask & mask)
>>>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>>>                                 }
>>>>>> +                               __print_simd_regs(intr, simd_mask);
>>>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>>>                                 fputc('\n', stderr);
>>>>>>                                 /* just printing available regs */
>>>>>>                                 goto error;
>>>>>>                         }
>>>>>> +
>>>>>> +                       if (simd_mask) {
>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>>>> +                               if (has_simd_regs)
>>>>>> +                                       goto next;
>>>>>> +                       }
>>>>>> +                       if (pred_mask) {
>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>>>> +                               if (has_simd_regs)
>>>>>> +                                       goto next;
>>>>>> +                       }
>>>>>> +
>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>>>                                         break;
>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>                         }
>>>>>>
>>>>>>                         *mode |= r->mask;
>>>>>> -
>>>>>> +next:
>>>>>>                         if (!p)
>>>>>>                                 break;
>>>>>>
>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>         ret = 0;
>>>>>>
>>>>>>         /* default to all possible regs */
>>>>>> -       if (*mode == 0)
>>>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>>>                 *mode = mask;
>>>>>>  error:
>>>>>>         free(os);
>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>>>
>>>>>>         return ret;
>>>>>>  }
>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>>>> --- a/tools/perf/util/perf_regs.c
>>>>>> +++ b/tools/perf/util/perf_regs.c
>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>>>         return SDT_ARG_SKIP;
>>>>>>  }
>>>>>>
>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>>>> +{
>>>>>> +       return false;
>>>>>> +}
>>>>>> +
>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>>>  {
>>>>>>         return 0;
>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>>>         return 0;
>>>>>>  }
>>>>>>
>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>         SMPL_REG_END
>>>>>>  };
>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>>>         return sample_reg_masks;
>>>>>>  }
>>>>>>
>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_reg_masks;
>>>>>> +}
>>>>>> +
>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_reg_masks;
>>>>>> +}
>>>>> Thinking out loud. I wonder if there is a way to hide the weak
>>>>> functions. It seems the support is tied to PMUs, particularly core
>>>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
>>>>> ask the PMU to parse the register strings, set up the perf_event_attr,
>>>>> etc. I'm somewhat scared these functions will be used on the report
>>>>> rather than record side of things, thereby breaking perf.data support
>>>>> when the host kernel does or doesn't have the SIMD support.
>>>> Ian, I don't quite follow your words.
>>>>
>>>> I don't quite understand how should we do for "push things into pmu and
>>>> arch pmu code". Current SIMD registers support follows the same way of the
>>>> general registers support. If we intend to change the way entirely, we'd
>>>> better have an independent patch-set.
>>>>
>>>> why these functions would break the perf.data repport? perf-report would
>>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
>>>> the flags is set (indicates there are SIMD registers data appended in the
>>>> record), perf-report would try to parse the SIMD registers data.
>>> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
>>> remove weak symbols like:
>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
>>>
>>> For these patches what I'm imagining is that there is a Nova Lake
>>> generated perf.data file. Using perf report, script, etc. on the Nova
>>> Lake should expose all of the same mask, qword, etc. values as when
>>> the perf.data was generated and so things will work. If the perf.data
>>> file was taken to say my Alderlake then what will happen? Generally
>>> using the arch directory and weak symbols is a code smell that cross
>>> platform things are going to break - there should be sufficient data
>>> in the event and the perf_event_attr to fully decode what's going on.
>>> Sometimes tying things to a PMU name can avoid the use of the arch
>>> directory. We were able to avoid the arch directory to a good extent
>>> for the TPEBS code, even though it is a very modern Intel feature.
>> I see.
>>
>> But the sampling support for SIMD registers is different with the sample
>> weight processing in the patch
>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
>> Each arch may support different kinds of SIMD registers and furthermore
>> each kind of SIMD register may have different register number and register
>> width. It's quite hard to figure out some common functions or fields to
>> represent the name and attributes of these arch-specific SIMD registers.
>> These arch-specific information can only be told by the arch-specific code.
>> So it looks the __weak functions are still the easiest way to implement this.
>>
>> I don't think the perf.data parsing would be broken from a platform to
>> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
>> To indicates the presence of SIMD registers in record data, a new ABI flag
>> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
>> platform is new enough and can recognize this new flag, then the SIMD
>> registers data would be parsed correctly. Even though the perf tool is old
>> and have no support of SIMD register, the data of SIMD registers would just
>> be silently ignored and should not break the parsing.
> That's good to know. I'm confused then why these functions can't just
> be within the arch directory? For example, we don't expose the
> intel-pt PMU code in the common code except for the parsing parts. A
> lot of that is handled by the default perf_event_attr initialization
> that every PMU can have its own variant of:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123

I see. From my point of view, there seems no essential difference between a
function pointer and a __weak function, and it looks hard to find a common
data structure to save all these function pointers which needs to be called
in different places, like register name parsing, register data dumpling ...


>
> Perhaps this is all just evidence of tech debt in the perf_regs.c code
> :-/ The bit that's relevant to the patch here is that I think this is
> adding to the tech debt problem as 11 more functions are added to
> perf_regs.h.

Yeah, 11 new __weak functions seems too much, we may merge the same kinds
of functions, like merging *_simd_reg_mask() and  *_pred_reg_mask() to a
single function with an type argument, then the new added __weak functions
could shrink half.


>
> Thanks,
> Ian
>
>>> Thanks,
>>> Ian
>>>
>>>
>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>> +
>>>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>>>  {
>>>>>>         const char *reg_name = NULL;
>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>>>> --- a/tools/perf/util/perf_regs.h
>>>>>> +++ b/tools/perf/util/perf_regs.h
>>>>>> @@ -24,9 +24,20 @@ enum {
>>>>>>  };
>>>>>>
>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>>>> +bool arch_has_simd_regs(u64 mask);
>>>>>>  uint64_t arch__intr_reg_mask(void);
>>>>>>  uint64_t arch__user_reg_mask(void);
>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>
>>>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>>>> --- a/tools/perf/util/record.h
>>>>>> +++ b/tools/perf/util/record.h
>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>>>         unsigned int  user_freq;
>>>>>>         u64           branch_stack;
>>>>>>         u64           sample_intr_regs;
>>>>>> +       u64           sample_intr_vec_regs;
>>>>>>         u64           sample_user_regs;
>>>>>> +       u64           sample_user_vec_regs;
>>>>>> +       u16           sample_pred_regs_qwords;
>>>>>> +       u16           sample_vec_regs_qwords;
>>>>>> +       u16           sample_intr_pred_regs;
>>>>>> +       u16           sample_user_pred_regs;
>>>>>>         u64           default_interval;
>>>>>>         u64           user_interval;
>>>>>>         size_t        auxtrace_snapshot_size;
>>>>>> --
>>>>>> 2.34.1
>>>>>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-04 15:47     ` Peter Zijlstra
@ 2025-12-05  6:37       ` Mi, Dapeng
  0 siblings, 0 replies; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-05  6:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/4/2025 11:47 PM, Peter Zijlstra wrote:
> On Thu, Dec 04, 2025 at 04:17:35PM +0100, Peter Zijlstra wrote:
>> On Wed, Dec 03, 2025 at 02:54:47PM +0800, Dapeng Mi wrote:
>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>
>>> While collecting XMM registers in a PEBS record has been supported since
>>> Icelake, non-PEBS events have lacked this feature. By leveraging the
>>> xsaves instruction, it is now possible to snapshot XMM registers for
>>> non-PEBS events, completing the feature set.
>>>
>>> To utilize the xsaves instruction, a 64-byte aligned buffer is required.
>>> A per-CPU ext_regs_buf is added to store SIMD and other registers, with
>>> the buffer size being approximately 2K. The buffer is allocated using
>>> kzalloc_node(), ensuring natural alignment and 64-byte alignment for all
>>> kmalloc() allocations with powers of 2.
>>>
>>> The XMM sampling support is extended for both REGS_USER and REGS_INTR.
>>> For REGS_USER, perf_get_regs_user() returns the registers from
>>> task_pt_regs(current), which is a pt_regs structure. It needs to be
>>> copied to user space secific x86_user_regs structure since kernel may
>>> modify pt_regs structure later.
>>>
>>> For PEBS, XMM registers are retrieved from PEBS records.
>>>
>>> In cases where userspace tasks are trapped within kernel mode (e.g.,
>>> during a syscall) when an NMI arrives, pt_regs information can still be
>>> retrieved from task_pt_regs(). However, capturing SIMD and other
>>> xsave-based registers in this scenario is challenging. Therefore,
>>> snapshots for these registers are omitted in such cases.
>>>
>>> The reasons are:
>>> - Profiling a userspace task that requires SIMD/eGPR registers typically
>>>   involves NMIs hitting userspace, not kernel mode.
>>> - Although it is possible to retrieve values when the TIF_NEED_FPU_LOAD
>>>   flag is set, the complexity introduced to handle this uncommon case in
>>>   the critical path is not justified.
>>> - Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
>>>   Some corner cases, such as an NMI occurring just after the flag switches
>>>   but still in kernel mode, cannot be handled.
>> Urgh.. Dave, Thomas, is there any reason we could not set
>> TIF_NEED_FPU_LOAD *after* doing the XSAVE (clearing is already done
>> after restore).
>>
>> That way, when an NMI sees TIF_NEED_FPU_LOAD it knows the task copy is
>> consistent.
>>
>> I'm not at all sure this is complex, it just needs a little care.
>>
>> And then there is the deferred thing, just like unwind, we can defer
>> REGS_USER/STACK_USER much the same, except someone went and built all
>> that deferred stuff with unwind all tangled into it :/
> With something like the below, the NMI could do something like:
>
> 	struct xregs_state *xr = NULL;
>
> 	/*
> 	 * fpu code does:
> 	 *  XSAVE
> 	 *  set_thread_flag(TIF_NEED_FPU_LOAD)
> 	 *  ...
> 	 *  XRSTOR
> 	 *  clear_thread_flag(TIF_NEED_FPU_LOAD)
> 	 * therefore, when TIF_NEED_FPU_LOAD, the task fpu state holds a
> 	 * whole copy.
> 	 */
> 	if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
> 		struct fpu *fpu = x86_task_fpu(current);
> 		/*
> 		 * If __task_fpstate is set, it holds the right pointer,
> 		 * otherwise fpstate will.
> 		 */
> 		struct fpstate *fps = READ_ONCE(fpu->__task_fpstate);
> 		if (!fps)
> 			fps = fpu->fpstate;
> 		xr = &fps->regs.xregs_state;
> 	} else {
> 		/* like fpu_sync_fpstate(), except NMI local */
> 		xsave_nmi(xr, mask);
> 	}
>
> 	// frob xr into perf data
>
> Or did I miss something? I've not looked at this very long and the above
> was very vague on the actual issues.
>
>
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index da233f20ae6f..0f91a0d7e799 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -359,18 +359,22 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
>  	struct fpstate *cur_fps = fpu->fpstate;
>  
>  	fpregs_lock();
> -	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD))
> +	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD)) {
>  		save_fpregs_to_fpstate(fpu);
> +		set_thread_flag(TIF_NEED_FPU_LOAD);
> +	}
>  
>  	/* Swap fpstate */
>  	if (enter_guest) {
> -		fpu->__task_fpstate = cur_fps;
> +		WRITE_ONCE(fpu->__task_fpstate, cur_fps);
> +		barrier();
>  		fpu->fpstate = guest_fps;
>  		guest_fps->in_use = true;
>  	} else {
>  		guest_fps->in_use = false;
>  		fpu->fpstate = fpu->__task_fpstate;
> -		fpu->__task_fpstate = NULL;
> +		barrier();
> +		WRITE_ONCE(fpu->__task_fpstate, NULL);
>  	}
>  
>  	cur_fps = fpu->fpstate;
> @@ -456,8 +460,8 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)
>  
>  	if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
>  	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
> -		set_thread_flag(TIF_NEED_FPU_LOAD);
>  		save_fpregs_to_fpstate(x86_task_fpu(current));
> +		set_thread_flag(TIF_NEED_FPU_LOAD);
>  	}
>  	__cpu_invalidate_fpregs_state();
>  

Ok, I would involve these changes into next version and support the SIMD
registers sampling for user space registers sampling.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-05  4:00             ` Mi, Dapeng
@ 2025-12-05  6:38               ` Ian Rogers
  2025-12-05  8:10                 ` Mi, Dapeng
  0 siblings, 1 reply; 55+ messages in thread
From: Ian Rogers @ 2025-12-05  6:38 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Thu, Dec 4, 2025 at 8:00 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 12/5/2025 12:16 AM, Ian Rogers wrote:
> > On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 12/4/2025 3:49 PM, Ian Rogers wrote:
> >>> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
> >>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >>>>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>>>
> >>>>>> This patch adds support for the newly introduced SIMD register sampling
> >>>>>> format by adding the following functions:
> >>>>>>
> >>>>>> uint64_t arch__intr_simd_reg_mask(void);
> >>>>>> uint64_t arch__user_simd_reg_mask(void);
> >>>>>> uint64_t arch__intr_pred_reg_mask(void);
> >>>>>> uint64_t arch__user_pred_reg_mask(void);
> >>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>
> >>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>>>>>
> >>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >>>>>> supported PRED registers, such as OPMASK on x86 platforms.
> >>>>>>
> >>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
> >>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
> >>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>>>>>
> >>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >>>>>> exact bitmap and number of qwords for a specific type of PRED register.
> >>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >>>>>> OPMASK).
> >>>>>>
> >>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
> >>>>>> these newly introduced SIMD registers. Currently, each type of register
> >>>>>> can only be sampled collectively; sampling a specific SIMD register is
> >>>>>> not supported. For example, all XMM registers are sampled together rather
> >>>>>> than sampling only XMM0.
> >>>>>>
> >>>>>> When multiple overlapping register types, such as XMM and YMM, are
> >>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>>>>>
> >>>>>> With this patch, all supported sampling registers on x86 platforms are
> >>>>>> displayed as follows.
> >>>>>>
> >>>>>>  $perf record -I?
> >>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>
> >>>>>>  $perf record --user-regs=?
> >>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>
> >>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>> ---
> >>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>>>>>  tools/perf/util/evsel.c                   |  27 ++
> >>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>>>>>  tools/perf/util/perf_regs.c               |  59 +++
> >>>>>>  tools/perf/util/perf_regs.h               |  11 +
> >>>>>>  tools/perf/util/record.h                  |   6 +
> >>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>>>>>
> >>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>> index 12fd93f04802..db41430f3b07 100644
> >>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
> >>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>> @@ -13,6 +13,49 @@
> >>>>>>  #include "../../../util/pmu.h"
> >>>>>>  #include "../../../util/pmus.h"
> >>>>>>
> >>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
> >>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >>>>>> +#endif
> >>>>>> +       SMPL_REG_END
> >>>>>> +};
> >>>>>> +
> >>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>>>>>         return SDT_ARG_VALID;
> >>>>>>  }
> >>>>>>
> >>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> >>>>> To make the code easier to read, it'd be nice to document sample_type,
> >>>>> qwords and mask here.
> >>>> Sure.
> >>>>
> >>>>
> >>>>>> +{
> >>>>>> +       struct perf_event_attr attr = {
> >>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>> +               .sample_type                    = sample_type,
> >>>>>> +               .disabled                       = 1,
> >>>>>> +               .exclude_kernel                 = 1,
> >>>>>> +               .sample_simd_regs_enabled       = 1,
> >>>>>> +       };
> >>>>>> +       int fd;
> >>>>>> +
> >>>>>> +       attr.sample_period = 1;
> >>>>>> +
> >>>>>> +       if (!pred) {
> >>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
> >>>>>> +               else
> >>>>>> +                       attr.sample_simd_vec_reg_user = mask;
> >>>>>> +       } else {
> >>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >>>>>> +               else
> >>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
> >>>>>> +               struct perf_pmu *pmu = NULL;
> >>>>>> +               __u64 type = PERF_TYPE_RAW;
> >>>>> It should be okay to do:
> >>>>> __u64 type = perf_pmus__find_core_pmu()->type
> >>>>> rather than have the whole loop below.
> >>>> Sure. Thanks.
> >>>>
> >>>>
> >>>>>> +
> >>>>>> +               /*
> >>>>>> +                * The same register set is supported among different hybrid PMUs.
> >>>>>> +                * Only check the first available one.
> >>>>>> +                */
> >>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >>>>>> +                       type = pmu->type;
> >>>>>> +                       break;
> >>>>>> +               }
> >>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       event_attr_init(&attr);
> >>>>>> +
> >>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>> +       if (fd != -1) {
> >>>>>> +               close(fd);
> >>>>>> +               return true;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return false;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>> +{
> >>>>>> +       bool supported = false;
> >>>>>> +       u64 bits;
> >>>>>> +
> >>>>>> +       *mask = 0;
> >>>>>> +       *qwords = 0;
> >>>>>> +
> >>>>>> +       switch (reg) {
> >>>>>> +       case PERF_REG_X86_XMM:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       case PERF_REG_X86_YMM:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       case PERF_REG_X86_ZMM:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >>>>>> +                       break;
> >>>>>> +               }
> >>>>>> +
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       default:
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return supported;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>> +{
> >>>>>> +       bool supported = false;
> >>>>>> +       u64 bits;
> >>>>>> +
> >>>>>> +       *mask = 0;
> >>>>>> +       *qwords = 0;
> >>>>>> +
> >>>>>> +       switch (reg) {
> >>>>>> +       case PERF_REG_X86_OPMASK:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       default:
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return supported;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool has_cap_simd_regs(void)
> >>>>>> +{
> >>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >>>>>> +       static bool has_cap_simd_regs;
> >>>>>> +       static bool cached;
> >>>>>> +
> >>>>>> +       if (cached)
> >>>>>> +               return has_cap_simd_regs;
> >>>>>> +
> >>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>> +       cached = true;
> >>>>>> +
> >>>>>> +       return has_cap_simd_regs;
> >>>>>> +}
> >>>>>> +
> >>>>>> +bool arch_has_simd_regs(u64 mask)
> >>>>>> +{
> >>>>>> +       return has_cap_simd_regs() &&
> >>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
> >>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >>>>>> +       SMPL_REG_END
> >>>>>> +};
> >>>>>> +
> >>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
> >>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >>>>>> +       SMPL_REG_END
> >>>>>> +};
> >>>>>> +
> >>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_simd_reg_masks;
> >>>>>> +}
> >>>>>> +
> >>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_pred_reg_masks;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool x86_intr_simd_updated;
> >>>>>> +static u64 x86_intr_simd_reg_mask;
> >>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>> Could we add some comments? I can kind of figure out the updated is a
> >>>>> check for lazy initialization and what masks are, qwords is an odd
> >>>>> one. The comment could also point out that SIMD doesn't mean the
> >>>>> machine supports SIMD, but SIMD registers are supported in perf
> >>>>> events.
> >>>> Sure.
> >>>>
> >>>>
> >>>>>> +static bool x86_user_simd_updated;
> >>>>>> +static u64 x86_user_simd_reg_mask;
> >>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>> +
> >>>>>> +static bool x86_intr_pred_updated;
> >>>>>> +static u64 x86_intr_pred_reg_mask;
> >>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +static bool x86_user_pred_updated;
> >>>>>> +static u64 x86_user_pred_reg_mask;
> >>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +
> >>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool supported;
> >>>>>> +       u64 mask = 0;
> >>>>>> +       int reg;
> >>>>>> +
> >>>>>> +       if (!has_cap_simd_regs())
> >>>>>> +               return 0;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >>>>>> +               return x86_intr_simd_reg_mask;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >>>>>> +               return x86_user_simd_reg_mask;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>> +               supported = false;
> >>>>>> +
> >>>>>> +               if (!r->mask)
> >>>>>> +                       continue;
> >>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>> +
> >>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >>>>>> +                       break;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_intr_simd_mask[reg],
> >>>>>> +                                                        &x86_intr_simd_qwords[reg]);
> >>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_user_simd_mask[reg],
> >>>>>> +                                                        &x86_user_simd_qwords[reg]);
> >>>>>> +               if (supported)
> >>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>> +               x86_intr_simd_reg_mask = mask;
> >>>>>> +               x86_intr_simd_updated = true;
> >>>>>> +       } else {
> >>>>>> +               x86_user_simd_reg_mask = mask;
> >>>>>> +               x86_user_simd_updated = true;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool supported;
> >>>>>> +       u64 mask = 0;
> >>>>>> +       int reg;
> >>>>>> +
> >>>>>> +       if (!has_cap_simd_regs())
> >>>>>> +               return 0;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >>>>>> +               return x86_intr_pred_reg_mask;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >>>>>> +               return x86_user_pred_reg_mask;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>> +               supported = false;
> >>>>>> +
> >>>>>> +               if (!r->mask)
> >>>>>> +                       continue;
> >>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>> +
> >>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >>>>>> +                       break;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_intr_pred_mask[reg],
> >>>>>> +                                                        &x86_intr_pred_qwords[reg]);
> >>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_user_pred_mask[reg],
> >>>>>> +                                                        &x86_user_pred_qwords[reg]);
> >>>>>> +               if (supported)
> >>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>> +               x86_intr_pred_reg_mask = mask;
> >>>>>> +               x86_intr_pred_updated = true;
> >>>>>> +       } else {
> >>>>>> +               x86_user_pred_reg_mask = mask;
> >>>>>> +               x86_user_pred_updated = true;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>> This feels repetitive with __arch__simd_reg_mask, could they be
> >>>>> refactored together?
> >>>> hmm, it looks we can extract the for loop as a common function. The other
> >>>> parts are hard to be generalized since they are manipulating different
> >>>> variables. If we want to generalize them, we have to introduce lots of "if
> >>>> ... else" branches and that would make code hard to be read.
> >>>>
> >>>>
> >>>>>> +
> >>>>>> +uint64_t arch__intr_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>> +{
> >>>>>> +       uint64_t mask = 0;
> >>>>>> +
> >>>>>> +       *qwords = 0;
> >>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >>>>>> +               if (intr) {
> >>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
> >>>>>> +                       mask = x86_intr_simd_mask[reg];
> >>>>>> +               } else {
> >>>>>> +                       *qwords = x86_user_simd_qwords[reg];
> >>>>>> +                       mask = x86_user_simd_mask[reg];
> >>>>>> +               }
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>> +{
> >>>>>> +       uint64_t mask = 0;
> >>>>>> +
> >>>>>> +       *qwords = 0;
> >>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >>>>>> +               if (intr) {
> >>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
> >>>>>> +                       mask = x86_intr_pred_mask[reg];
> >>>>>> +               } else {
> >>>>>> +                       *qwords = x86_user_pred_qwords[reg];
> >>>>>> +                       mask = x86_user_pred_mask[reg];
> >>>>>> +               }
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_intr_simd_updated)
> >>>>>> +               arch__intr_simd_reg_mask();
> >>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_user_simd_updated)
> >>>>>> +               arch__user_simd_reg_mask();
> >>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_intr_pred_updated)
> >>>>>> +               arch__intr_pred_reg_mask();
> >>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_user_pred_updated)
> >>>>>> +               arch__user_pred_reg_mask();
> >>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >>>>>> +}
> >>>>>> +
> >>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
> >>>>>>  {
> >>>>>> +       if (has_cap_simd_regs())
> >>>>>> +               return sample_reg_masks_ext;
> >>>>>>         return sample_reg_masks;
> >>>>>>  }
> >>>>>>
> >>>>>> -uint64_t arch__intr_reg_mask(void)
> >>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>>>>>  {
> >>>>>>         struct perf_event_attr attr = {
> >>>>>> -               .type                   = PERF_TYPE_HARDWARE,
> >>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >>>>>> -               .precise_ip             = 1,
> >>>>>> -               .disabled               = 1,
> >>>>>> -               .exclude_kernel         = 1,
> >>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>> +               .sample_type                    = sample_type,
> >>>>>> +               .precise_ip                     = 1,
> >>>>>> +               .disabled                       = 1,
> >>>>>> +               .exclude_kernel                 = 1,
> >>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>>>>>         };
> >>>>>>         int fd;
> >>>>>>         /*
> >>>>>>          * In an unnamed union, init it here to build on older gcc versions
> >>>>>>          */
> >>>>>>         attr.sample_period = 1;
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +               attr.sample_regs_intr = mask;
> >>>>>> +       else
> >>>>>> +               attr.sample_regs_user = mask;
> >>>>>>
> >>>>>>         if (perf_pmus__num_core_pmus() > 1) {
> >>>>>>                 struct perf_pmu *pmu = NULL;
> >>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>>         if (fd != -1) {
> >>>>>>                 close(fd);
> >>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >>>>>> +               return mask;
> >>>>>>         }
> >>>>>>
> >>>>>> -       return PERF_REGS_MASK;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_reg_mask(void)
> >>>>>> +{
> >>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>> +
> >>>>>> +       if (has_cap_simd_regs()) {
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>> +                                        true);
> >>>>> It's nice to label constant arguments like this something like:
> >>>>> /*has_simd_regs=*/true);
> >>>>>
> >>>>> Tools like clang-tidy even try to enforce the argument names match the comments.
> >>>> Sure.
> >>>>
> >>>>
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>> +                                        true);
> >>>>>> +       } else
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>>  }
> >>>>>>
> >>>>>>  uint64_t arch__user_reg_mask(void)
> >>>>>>  {
> >>>>>> -       return PERF_REGS_MASK;
> >>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>> +
> >>>>>> +       if (has_cap_simd_regs()) {
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>> +                                        true);
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>> +                                        true);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>> The code is repetitive here, could we refactor into a single function
> >>>>> passing in a user or instr value?
> >>>> Sure. Would extract the common part.
> >>>>
> >>>>
> >>>>>>  }
> >>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>>>> index 56ebefd075f2..5d1d90cf9488 100644
> >>>>>> --- a/tools/perf/util/evsel.c
> >>>>>> +++ b/tools/perf/util/evsel.c
> >>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>> +               else
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>>         }
> >>>>>>
> >>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
> >>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>> +               else
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>>         }
> >>>>>>
> >>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >>>>>> index cda1c620968e..0bd100392889 100644
> >>>>>> --- a/tools/perf/util/parse-regs-options.c
> >>>>>> +++ b/tools/perf/util/parse-regs-options.c
> >>>>>> @@ -4,19 +4,139 @@
> >>>>>>  #include <stdint.h>
> >>>>>>  #include <string.h>
> >>>>>>  #include <stdio.h>
> >>>>>> +#include <linux/bitops.h>
> >>>>>>  #include "util/debug.h"
> >>>>>>  #include <subcmd/parse-options.h>
> >>>>>>  #include "util/perf_regs.h"
> >>>>>>  #include "util/parse-regs-options.h"
> >>>>>> +#include "record.h"
> >>>>>> +
> >>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       if (!simd_mask)
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>> +               if (!(r->mask & simd_mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               if (bitmap)
> >>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>> +       }
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       if (!pred_mask)
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>> +               if (!(r->mask & pred_mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               if (bitmap)
> >>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>> +       }
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool matched = false;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>> +               if (strcasecmp(s, r->name))
> >>>>>> +                       continue;
> >>>>>> +               if (!fls64(r->mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               matched = true;
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       /* Just need the highest qwords */
> >>>>> I'm not following here. Does the bitmap need to handle gaps?
> >>>> Currently no. In theory, the kernel supports user space only samples a
> >>>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
> >>>> supports 16 XMM registers on XMM), but it's not supported to avoid
> >>>> introducing too much complexity in perf tools. Moreover, I don't think end
> >>>> users have such requirement. In most cases, users should only know which
> >>>> kinds of SIMD registers their programs use but usually doesn't know and
> >>>> care about which exact SIMD register is used.
> >>>>
> >>>>
> >>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
> >>>>>> +               opts->sample_vec_regs_qwords = qwords;
> >>>>>> +               if (intr)
> >>>>>> +                       opts->sample_intr_vec_regs = bitmap;
> >>>>>> +               else
> >>>>>> +                       opts->sample_user_vec_regs = bitmap;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return matched;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool matched = false;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>> +               if (strcasecmp(s, r->name))
> >>>>>> +                       continue;
> >>>>>> +               if (!fls64(r->mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               matched = true;
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       /* Just need the highest qwords */
> >>>>> Again repetitive, could we have a single function?
> >>>> Yes, I suppose the for loop at least can be extracted as a common function.
> >>>>
> >>>>
> >>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
> >>>>>> +               opts->sample_pred_regs_qwords = qwords;
> >>>>>> +               if (intr)
> >>>>>> +                       opts->sample_intr_pred_regs = bitmap;
> >>>>>> +               else
> >>>>>> +                       opts->sample_user_pred_regs = bitmap;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return matched;
> >>>>>> +}
> >>>>>>
> >>>>>>  static int
> >>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>  {
> >>>>>>         uint64_t *mode = (uint64_t *)opt->value;
> >>>>>>         const struct sample_reg *r = NULL;
> >>>>>> +       struct record_opts *opts;
> >>>>>>         char *s, *os = NULL, *p;
> >>>>>> -       int ret = -1;
> >>>>>> +       bool has_simd_regs = false;
> >>>>>>         uint64_t mask;
> >>>>>> +       uint64_t simd_mask;
> >>>>>> +       uint64_t pred_mask;
> >>>>>> +       int ret = -1;
> >>>>>>
> >>>>>>         if (unset)
> >>>>>>                 return 0;
> >>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>         if (*mode)
> >>>>>>                 return -1;
> >>>>>>
> >>>>>> -       if (intr)
> >>>>>> +       if (intr) {
> >>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>>>>>                 mask = arch__intr_reg_mask();
> >>>>>> -       else
> >>>>>> +               simd_mask = arch__intr_simd_reg_mask();
> >>>>>> +               pred_mask = arch__intr_pred_reg_mask();
> >>>>>> +       } else {
> >>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>>>>>                 mask = arch__user_reg_mask();
> >>>>>> +               simd_mask = arch__user_simd_reg_mask();
> >>>>>> +               pred_mask = arch__user_pred_reg_mask();
> >>>>>> +       }
> >>>>>>
> >>>>>>         /* str may be NULL in case no arg is passed to -I */
> >>>>>>         if (str) {
> >>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>                                         if (r->mask & mask)
> >>>>>>                                                 fprintf(stderr, "%s ", r->name);
> >>>>>>                                 }
> >>>>>> +                               __print_simd_regs(intr, simd_mask);
> >>>>>> +                               __print_pred_regs(intr, pred_mask);
> >>>>>>                                 fputc('\n', stderr);
> >>>>>>                                 /* just printing available regs */
> >>>>>>                                 goto error;
> >>>>>>                         }
> >>>>>> +
> >>>>>> +                       if (simd_mask) {
> >>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >>>>>> +                               if (has_simd_regs)
> >>>>>> +                                       goto next;
> >>>>>> +                       }
> >>>>>> +                       if (pred_mask) {
> >>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >>>>>> +                               if (has_simd_regs)
> >>>>>> +                                       goto next;
> >>>>>> +                       }
> >>>>>> +
> >>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>>>>>                                         break;
> >>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>                         }
> >>>>>>
> >>>>>>                         *mode |= r->mask;
> >>>>>> -
> >>>>>> +next:
> >>>>>>                         if (!p)
> >>>>>>                                 break;
> >>>>>>
> >>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>         ret = 0;
> >>>>>>
> >>>>>>         /* default to all possible regs */
> >>>>>> -       if (*mode == 0)
> >>>>>> +       if (*mode == 0 && !has_simd_regs)
> >>>>>>                 *mode = mask;
> >>>>>>  error:
> >>>>>>         free(os);
> >>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>> index 66b666d9ce64..fb0366d050cf 100644
> >>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
> >>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>>>>>
> >>>>>>         return ret;
> >>>>>>  }
> >>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
> >>>>>> --- a/tools/perf/util/perf_regs.c
> >>>>>> +++ b/tools/perf/util/perf_regs.c
> >>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>>>>>         return SDT_ARG_SKIP;
> >>>>>>  }
> >>>>>>
> >>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >>>>>> +{
> >>>>>> +       return false;
> >>>>>> +}
> >>>>>> +
> >>>>>>  uint64_t __weak arch__intr_reg_mask(void)
> >>>>>>  {
> >>>>>>         return 0;
> >>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>>>>>         return 0;
> >>>>>>  }
> >>>>>>
> >>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>         SMPL_REG_END
> >>>>>>  };
> >>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>>>>>         return sample_reg_masks;
> >>>>>>  }
> >>>>>>
> >>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_reg_masks;
> >>>>>> +}
> >>>>>> +
> >>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_reg_masks;
> >>>>>> +}
> >>>>> Thinking out loud. I wonder if there is a way to hide the weak
> >>>>> functions. It seems the support is tied to PMUs, particularly core
> >>>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
> >>>>> ask the PMU to parse the register strings, set up the perf_event_attr,
> >>>>> etc. I'm somewhat scared these functions will be used on the report
> >>>>> rather than record side of things, thereby breaking perf.data support
> >>>>> when the host kernel does or doesn't have the SIMD support.
> >>>> Ian, I don't quite follow your words.
> >>>>
> >>>> I don't quite understand how should we do for "push things into pmu and
> >>>> arch pmu code". Current SIMD registers support follows the same way of the
> >>>> general registers support. If we intend to change the way entirely, we'd
> >>>> better have an independent patch-set.
> >>>>
> >>>> why these functions would break the perf.data repport? perf-report would
> >>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
> >>>> the flags is set (indicates there are SIMD registers data appended in the
> >>>> record), perf-report would try to parse the SIMD registers data.
> >>> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
> >>> remove weak symbols like:
> >>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
> >>>
> >>> For these patches what I'm imagining is that there is a Nova Lake
> >>> generated perf.data file. Using perf report, script, etc. on the Nova
> >>> Lake should expose all of the same mask, qword, etc. values as when
> >>> the perf.data was generated and so things will work. If the perf.data
> >>> file was taken to say my Alderlake then what will happen? Generally
> >>> using the arch directory and weak symbols is a code smell that cross
> >>> platform things are going to break - there should be sufficient data
> >>> in the event and the perf_event_attr to fully decode what's going on.
> >>> Sometimes tying things to a PMU name can avoid the use of the arch
> >>> directory. We were able to avoid the arch directory to a good extent
> >>> for the TPEBS code, even though it is a very modern Intel feature.
> >> I see.
> >>
> >> But the sampling support for SIMD registers is different with the sample
> >> weight processing in the patch
> >> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
> >> Each arch may support different kinds of SIMD registers and furthermore
> >> each kind of SIMD register may have different register number and register
> >> width. It's quite hard to figure out some common functions or fields to
> >> represent the name and attributes of these arch-specific SIMD registers.
> >> These arch-specific information can only be told by the arch-specific code.
> >> So it looks the __weak functions are still the easiest way to implement this.
> >>
> >> I don't think the perf.data parsing would be broken from a platform to
> >> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
> >> To indicates the presence of SIMD registers in record data, a new ABI flag
> >> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
> >> platform is new enough and can recognize this new flag, then the SIMD
> >> registers data would be parsed correctly. Even though the perf tool is old
> >> and have no support of SIMD register, the data of SIMD registers would just
> >> be silently ignored and should not break the parsing.
> > That's good to know. I'm confused then why these functions can't just
> > be within the arch directory? For example, we don't expose the
> > intel-pt PMU code in the common code except for the parsing parts. A
> > lot of that is handled by the default perf_event_attr initialization
> > that every PMU can have its own variant of:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123
>
> I see. From my point of view, there seems no essential difference between a
> function pointer and a __weak function, and it looks hard to find a common
> data structure to save all these function pointers which needs to be called
> in different places, like register name parsing, register data dumpling ...
>
>
> >
> > Perhaps this is all just evidence of tech debt in the perf_regs.c code
> > :-/ The bit that's relevant to the patch here is that I think this is
> > adding to the tech debt problem as 11 more functions are added to
> > perf_regs.h.
>
> Yeah, 11 new __weak functions seems too much, we may merge the same kinds
> of functions, like merging *_simd_reg_mask() and  *_pred_reg_mask() to a
> single function with an type argument, then the new added __weak functions
> could shrink half.

There could be a good reason for 11 weak functions :-) In the
perf_event.h you've added to the sample event:
```
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;
+        *              u16 vector_qwords;
+        *              u16 nr_pred;
+        *              u16 pred_qwords;
+        *              u64 data[nr_vectors * vector_qwords + nr_pred
* pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_USER
```
so these things are readable/writable outside of builds with arch/x86
compiled in, which is why it seems odd that there needs to be arch
code in the common code to handle them. Similar to how I needed to get
the retirement latency parsing out of the arch/x86 directory as
potentially you could be looking at a perf.data file with retirement
latencies in it on a non-x86 platform.

Thanks,
Ian

>
> >
> > Thanks,
> > Ian
> >
> >>> Thanks,
> >>> Ian
> >>>
> >>>
> >>>
> >>>>> Thanks,
> >>>>> Ian
> >>>>>
> >>>>>> +
> >>>>>>  const char *perf_reg_name(int id, const char *arch)
> >>>>>>  {
> >>>>>>         const char *reg_name = NULL;
> >>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
> >>>>>> --- a/tools/perf/util/perf_regs.h
> >>>>>> +++ b/tools/perf/util/perf_regs.h
> >>>>>> @@ -24,9 +24,20 @@ enum {
> >>>>>>  };
> >>>>>>
> >>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >>>>>> +bool arch_has_simd_regs(u64 mask);
> >>>>>>  uint64_t arch__intr_reg_mask(void);
> >>>>>>  uint64_t arch__user_reg_mask(void);
> >>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
> >>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> >>>>>> +uint64_t arch__intr_simd_reg_mask(void);
> >>>>>> +uint64_t arch__user_simd_reg_mask(void);
> >>>>>> +uint64_t arch__intr_pred_reg_mask(void);
> >>>>>> +uint64_t arch__user_pred_reg_mask(void);
> >>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>
> >>>>>>  const char *perf_reg_name(int id, const char *arch);
> >>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
> >>>>>> --- a/tools/perf/util/record.h
> >>>>>> +++ b/tools/perf/util/record.h
> >>>>>> @@ -59,7 +59,13 @@ struct record_opts {
> >>>>>>         unsigned int  user_freq;
> >>>>>>         u64           branch_stack;
> >>>>>>         u64           sample_intr_regs;
> >>>>>> +       u64           sample_intr_vec_regs;
> >>>>>>         u64           sample_user_regs;
> >>>>>> +       u64           sample_user_vec_regs;
> >>>>>> +       u16           sample_pred_regs_qwords;
> >>>>>> +       u16           sample_vec_regs_qwords;
> >>>>>> +       u16           sample_intr_pred_regs;
> >>>>>> +       u16           sample_user_pred_regs;
> >>>>>>         u64           default_interval;
> >>>>>>         u64           user_interval;
> >>>>>>         size_t        auxtrace_snapshot_size;
> >>>>>> --
> >>>>>> 2.34.1
> >>>>>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-05  6:38               ` Ian Rogers
@ 2025-12-05  8:10                 ` Mi, Dapeng
  2025-12-05 16:35                   ` Ian Rogers
  0 siblings, 1 reply; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-05  8:10 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 2:38 PM, Ian Rogers wrote:
> On Thu, Dec 4, 2025 at 8:00 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 12/5/2025 12:16 AM, Ian Rogers wrote:
>>> On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 12/4/2025 3:49 PM, Ian Rogers wrote:
>>>>> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
>>>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>>
>>>>>>>> This patch adds support for the newly introduced SIMD register sampling
>>>>>>>> format by adding the following functions:
>>>>>>>>
>>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>
>>>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>>>>>
>>>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>>>>>
>>>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>>>>>
>>>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>>>>>> OPMASK).
>>>>>>>>
>>>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>>>>>> these newly introduced SIMD registers. Currently, each type of register
>>>>>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>>>>>> not supported. For example, all XMM registers are sampled together rather
>>>>>>>> than sampling only XMM0.
>>>>>>>>
>>>>>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>>>>>
>>>>>>>> With this patch, all supported sampling registers on x86 platforms are
>>>>>>>> displayed as follows.
>>>>>>>>
>>>>>>>>  $perf record -I?
>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>
>>>>>>>>  $perf record --user-regs=?
>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>
>>>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>> ---
>>>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>>>>>  tools/perf/util/record.h                  |   6 +
>>>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>> index 12fd93f04802..db41430f3b07 100644
>>>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>> @@ -13,6 +13,49 @@
>>>>>>>>  #include "../../../util/pmu.h"
>>>>>>>>  #include "../../../util/pmus.h"
>>>>>>>>
>>>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>>>>>> +#endif
>>>>>>>> +       SMPL_REG_END
>>>>>>>> +};
>>>>>>>> +
>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>>>>>         return SDT_ARG_VALID;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>>>>>> To make the code easier to read, it'd be nice to document sample_type,
>>>>>>> qwords and mask here.
>>>>>> Sure.
>>>>>>
>>>>>>
>>>>>>>> +{
>>>>>>>> +       struct perf_event_attr attr = {
>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>> +               .disabled                       = 1,
>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>> +               .sample_simd_regs_enabled       = 1,
>>>>>>>> +       };
>>>>>>>> +       int fd;
>>>>>>>> +
>>>>>>>> +       attr.sample_period = 1;
>>>>>>>> +
>>>>>>>> +       if (!pred) {
>>>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>>>>>> +               else
>>>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>>>>>> +       } else {
>>>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>>>>>> +               else
>>>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>> +               struct perf_pmu *pmu = NULL;
>>>>>>>> +               __u64 type = PERF_TYPE_RAW;
>>>>>>> It should be okay to do:
>>>>>>> __u64 type = perf_pmus__find_core_pmu()->type
>>>>>>> rather than have the whole loop below.
>>>>>> Sure. Thanks.
>>>>>>
>>>>>>
>>>>>>>> +
>>>>>>>> +               /*
>>>>>>>> +                * The same register set is supported among different hybrid PMUs.
>>>>>>>> +                * Only check the first available one.
>>>>>>>> +                */
>>>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>>>>>> +                       type = pmu->type;
>>>>>>>> +                       break;
>>>>>>>> +               }
>>>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       event_attr_init(&attr);
>>>>>>>> +
>>>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>> +       if (fd != -1) {
>>>>>>>> +               close(fd);
>>>>>>>> +               return true;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return false;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       bool supported = false;
>>>>>>>> +       u64 bits;
>>>>>>>> +
>>>>>>>> +       *mask = 0;
>>>>>>>> +       *qwords = 0;
>>>>>>>> +
>>>>>>>> +       switch (reg) {
>>>>>>>> +       case PERF_REG_X86_XMM:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       case PERF_REG_X86_YMM:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       case PERF_REG_X86_ZMM:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>>>>>> +                       break;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       default:
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return supported;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       bool supported = false;
>>>>>>>> +       u64 bits;
>>>>>>>> +
>>>>>>>> +       *mask = 0;
>>>>>>>> +       *qwords = 0;
>>>>>>>> +
>>>>>>>> +       switch (reg) {
>>>>>>>> +       case PERF_REG_X86_OPMASK:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       default:
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return supported;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool has_cap_simd_regs(void)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>>>>>> +       static bool has_cap_simd_regs;
>>>>>>>> +       static bool cached;
>>>>>>>> +
>>>>>>>> +       if (cached)
>>>>>>>> +               return has_cap_simd_regs;
>>>>>>>> +
>>>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>> +       cached = true;
>>>>>>>> +
>>>>>>>> +       return has_cap_simd_regs;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +bool arch_has_simd_regs(u64 mask)
>>>>>>>> +{
>>>>>>>> +       return has_cap_simd_regs() &&
>>>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>>>>>> +       SMPL_REG_END
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>>>>>> +       SMPL_REG_END
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_simd_reg_masks;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_pred_reg_masks;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool x86_intr_simd_updated;
>>>>>>>> +static u64 x86_intr_simd_reg_mask;
>>>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>> Could we add some comments? I can kind of figure out the updated is a
>>>>>>> check for lazy initialization and what masks are, qwords is an odd
>>>>>>> one. The comment could also point out that SIMD doesn't mean the
>>>>>>> machine supports SIMD, but SIMD registers are supported in perf
>>>>>>> events.
>>>>>> Sure.
>>>>>>
>>>>>>
>>>>>>>> +static bool x86_user_simd_updated;
>>>>>>>> +static u64 x86_user_simd_reg_mask;
>>>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>> +
>>>>>>>> +static bool x86_intr_pred_updated;
>>>>>>>> +static u64 x86_intr_pred_reg_mask;
>>>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +static bool x86_user_pred_updated;
>>>>>>>> +static u64 x86_user_pred_reg_mask;
>>>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +
>>>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool supported;
>>>>>>>> +       u64 mask = 0;
>>>>>>>> +       int reg;
>>>>>>>> +
>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>> +               return 0;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>>>>>> +               return x86_intr_simd_reg_mask;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>>>>>> +               return x86_user_simd_reg_mask;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>> +               supported = false;
>>>>>>>> +
>>>>>>>> +               if (!r->mask)
>>>>>>>> +                       continue;
>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>> +
>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>>>>>> +                       break;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_intr_simd_mask[reg],
>>>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_user_simd_mask[reg],
>>>>>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>>>>>> +               if (supported)
>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>> +               x86_intr_simd_reg_mask = mask;
>>>>>>>> +               x86_intr_simd_updated = true;
>>>>>>>> +       } else {
>>>>>>>> +               x86_user_simd_reg_mask = mask;
>>>>>>>> +               x86_user_simd_updated = true;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool supported;
>>>>>>>> +       u64 mask = 0;
>>>>>>>> +       int reg;
>>>>>>>> +
>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>> +               return 0;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>>>>>> +               return x86_intr_pred_reg_mask;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>>>>>> +               return x86_user_pred_reg_mask;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>> +               supported = false;
>>>>>>>> +
>>>>>>>> +               if (!r->mask)
>>>>>>>> +                       continue;
>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>> +
>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>>>>>> +                       break;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_intr_pred_mask[reg],
>>>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_user_pred_mask[reg],
>>>>>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>>>>>> +               if (supported)
>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>> +               x86_intr_pred_reg_mask = mask;
>>>>>>>> +               x86_intr_pred_updated = true;
>>>>>>>> +       } else {
>>>>>>>> +               x86_user_pred_reg_mask = mask;
>>>>>>>> +               x86_user_pred_updated = true;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>> This feels repetitive with __arch__simd_reg_mask, could they be
>>>>>>> refactored together?
>>>>>> hmm, it looks we can extract the for loop as a common function. The other
>>>>>> parts are hard to be generalized since they are manipulating different
>>>>>> variables. If we want to generalize them, we have to introduce lots of "if
>>>>>> ... else" branches and that would make code hard to be read.
>>>>>>
>>>>>>
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = 0;
>>>>>>>> +
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>>>>>> +               if (intr) {
>>>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>>>>>> +                       mask = x86_intr_simd_mask[reg];
>>>>>>>> +               } else {
>>>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>>>>>> +                       mask = x86_user_simd_mask[reg];
>>>>>>>> +               }
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = 0;
>>>>>>>> +
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>>>>>> +               if (intr) {
>>>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>>>>>> +                       mask = x86_intr_pred_mask[reg];
>>>>>>>> +               } else {
>>>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>>>>>> +                       mask = x86_user_pred_mask[reg];
>>>>>>>> +               }
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_intr_simd_updated)
>>>>>>>> +               arch__intr_simd_reg_mask();
>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_user_simd_updated)
>>>>>>>> +               arch__user_simd_reg_mask();
>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_intr_pred_updated)
>>>>>>>> +               arch__intr_pred_reg_mask();
>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_user_pred_updated)
>>>>>>>> +               arch__user_pred_reg_mask();
>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>>>>>  {
>>>>>>>> +       if (has_cap_simd_regs())
>>>>>>>> +               return sample_reg_masks_ext;
>>>>>>>>         return sample_reg_masks;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> -uint64_t arch__intr_reg_mask(void)
>>>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>>>>>  {
>>>>>>>>         struct perf_event_attr attr = {
>>>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>>>>>> -               .precise_ip             = 1,
>>>>>>>> -               .disabled               = 1,
>>>>>>>> -               .exclude_kernel         = 1,
>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>> +               .precise_ip                     = 1,
>>>>>>>> +               .disabled                       = 1,
>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>>>>>         };
>>>>>>>>         int fd;
>>>>>>>>         /*
>>>>>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>>>>>          */
>>>>>>>>         attr.sample_period = 1;
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +               attr.sample_regs_intr = mask;
>>>>>>>> +       else
>>>>>>>> +               attr.sample_regs_user = mask;
>>>>>>>>
>>>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>>                 struct perf_pmu *pmu = NULL;
>>>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>>         if (fd != -1) {
>>>>>>>>                 close(fd);
>>>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>>>>>> +               return mask;
>>>>>>>>         }
>>>>>>>>
>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>> +
>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>> +                                        true);
>>>>>>> It's nice to label constant arguments like this something like:
>>>>>>> /*has_simd_regs=*/true);
>>>>>>>
>>>>>>> Tools like clang-tidy even try to enforce the argument names match the comments.
>>>>>> Sure.
>>>>>>
>>>>>>
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>> +                                        true);
>>>>>>>> +       } else
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  uint64_t arch__user_reg_mask(void)
>>>>>>>>  {
>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>> +
>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>> +                                        true);
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>> +                                        true);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>> The code is repetitive here, could we refactor into a single function
>>>>>>> passing in a user or instr value?
>>>>>> Sure. Would extract the common part.
>>>>>>
>>>>>>
>>>>>>>>  }
>>>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>>>>>> --- a/tools/perf/util/evsel.c
>>>>>>>> +++ b/tools/perf/util/evsel.c
>>>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>> +               else
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>> +               else
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>>         }
>>>>>>>>
>>>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>>>>>> index cda1c620968e..0bd100392889 100644
>>>>>>>> --- a/tools/perf/util/parse-regs-options.c
>>>>>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>>>>>> @@ -4,19 +4,139 @@
>>>>>>>>  #include <stdint.h>
>>>>>>>>  #include <string.h>
>>>>>>>>  #include <stdio.h>
>>>>>>>> +#include <linux/bitops.h>
>>>>>>>>  #include "util/debug.h"
>>>>>>>>  #include <subcmd/parse-options.h>
>>>>>>>>  #include "util/perf_regs.h"
>>>>>>>>  #include "util/parse-regs-options.h"
>>>>>>>> +#include "record.h"
>>>>>>>> +
>>>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       if (!simd_mask)
>>>>>>>> +               return;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>> +               if (!(r->mask & simd_mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               if (bitmap)
>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>> +       }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       if (!pred_mask)
>>>>>>>> +               return;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>> +               if (!(r->mask & pred_mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               if (bitmap)
>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>> +       }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool matched = false;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>> +                       continue;
>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               matched = true;
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       /* Just need the highest qwords */
>>>>>>> I'm not following here. Does the bitmap need to handle gaps?
>>>>>> Currently no. In theory, the kernel supports user space only samples a
>>>>>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
>>>>>> supports 16 XMM registers on XMM), but it's not supported to avoid
>>>>>> introducing too much complexity in perf tools. Moreover, I don't think end
>>>>>> users have such requirement. In most cases, users should only know which
>>>>>> kinds of SIMD registers their programs use but usually doesn't know and
>>>>>> care about which exact SIMD register is used.
>>>>>>
>>>>>>
>>>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>>>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>>>>>> +               if (intr)
>>>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>>>>>> +               else
>>>>>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return matched;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool matched = false;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>> +                       continue;
>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               matched = true;
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       /* Just need the highest qwords */
>>>>>>> Again repetitive, could we have a single function?
>>>>>> Yes, I suppose the for loop at least can be extracted as a common function.
>>>>>>
>>>>>>
>>>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>>>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>>>>>> +               if (intr)
>>>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>>>>>> +               else
>>>>>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return matched;
>>>>>>>> +}
>>>>>>>>
>>>>>>>>  static int
>>>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>  {
>>>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>>>>>         const struct sample_reg *r = NULL;
>>>>>>>> +       struct record_opts *opts;
>>>>>>>>         char *s, *os = NULL, *p;
>>>>>>>> -       int ret = -1;
>>>>>>>> +       bool has_simd_regs = false;
>>>>>>>>         uint64_t mask;
>>>>>>>> +       uint64_t simd_mask;
>>>>>>>> +       uint64_t pred_mask;
>>>>>>>> +       int ret = -1;
>>>>>>>>
>>>>>>>>         if (unset)
>>>>>>>>                 return 0;
>>>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>         if (*mode)
>>>>>>>>                 return -1;
>>>>>>>>
>>>>>>>> -       if (intr)
>>>>>>>> +       if (intr) {
>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>>>>>                 mask = arch__intr_reg_mask();
>>>>>>>> -       else
>>>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>>>>>> +       } else {
>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>>>>>                 mask = arch__user_reg_mask();
>>>>>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>>>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>>>>>> +       }
>>>>>>>>
>>>>>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>>>>>         if (str) {
>>>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>                                         if (r->mask & mask)
>>>>>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>>>>>                                 }
>>>>>>>> +                               __print_simd_regs(intr, simd_mask);
>>>>>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>>>>>                                 fputc('\n', stderr);
>>>>>>>>                                 /* just printing available regs */
>>>>>>>>                                 goto error;
>>>>>>>>                         }
>>>>>>>> +
>>>>>>>> +                       if (simd_mask) {
>>>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>>>>>> +                               if (has_simd_regs)
>>>>>>>> +                                       goto next;
>>>>>>>> +                       }
>>>>>>>> +                       if (pred_mask) {
>>>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>>>>>> +                               if (has_simd_regs)
>>>>>>>> +                                       goto next;
>>>>>>>> +                       }
>>>>>>>> +
>>>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>>>>>                                         break;
>>>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>                         }
>>>>>>>>
>>>>>>>>                         *mode |= r->mask;
>>>>>>>> -
>>>>>>>> +next:
>>>>>>>>                         if (!p)
>>>>>>>>                                 break;
>>>>>>>>
>>>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>         ret = 0;
>>>>>>>>
>>>>>>>>         /* default to all possible regs */
>>>>>>>> -       if (*mode == 0)
>>>>>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>>>>>                 *mode = mask;
>>>>>>>>  error:
>>>>>>>>         free(os);
>>>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>>>>>
>>>>>>>>         return ret;
>>>>>>>>  }
>>>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>>>>>> --- a/tools/perf/util/perf_regs.c
>>>>>>>> +++ b/tools/perf/util/perf_regs.c
>>>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>>>>>         return SDT_ARG_SKIP;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>>>>>> +{
>>>>>>>> +       return false;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>>>>>  {
>>>>>>>>         return 0;
>>>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>>>>>         return 0;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>         SMPL_REG_END
>>>>>>>>  };
>>>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>>>>>         return sample_reg_masks;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_reg_masks;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_reg_masks;
>>>>>>>> +}
>>>>>>> Thinking out loud. I wonder if there is a way to hide the weak
>>>>>>> functions. It seems the support is tied to PMUs, particularly core
>>>>>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
>>>>>>> ask the PMU to parse the register strings, set up the perf_event_attr,
>>>>>>> etc. I'm somewhat scared these functions will be used on the report
>>>>>>> rather than record side of things, thereby breaking perf.data support
>>>>>>> when the host kernel does or doesn't have the SIMD support.
>>>>>> Ian, I don't quite follow your words.
>>>>>>
>>>>>> I don't quite understand how should we do for "push things into pmu and
>>>>>> arch pmu code". Current SIMD registers support follows the same way of the
>>>>>> general registers support. If we intend to change the way entirely, we'd
>>>>>> better have an independent patch-set.
>>>>>>
>>>>>> why these functions would break the perf.data repport? perf-report would
>>>>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
>>>>>> the flags is set (indicates there are SIMD registers data appended in the
>>>>>> record), perf-report would try to parse the SIMD registers data.
>>>>> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
>>>>> remove weak symbols like:
>>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
>>>>>
>>>>> For these patches what I'm imagining is that there is a Nova Lake
>>>>> generated perf.data file. Using perf report, script, etc. on the Nova
>>>>> Lake should expose all of the same mask, qword, etc. values as when
>>>>> the perf.data was generated and so things will work. If the perf.data
>>>>> file was taken to say my Alderlake then what will happen? Generally
>>>>> using the arch directory and weak symbols is a code smell that cross
>>>>> platform things are going to break - there should be sufficient data
>>>>> in the event and the perf_event_attr to fully decode what's going on.
>>>>> Sometimes tying things to a PMU name can avoid the use of the arch
>>>>> directory. We were able to avoid the arch directory to a good extent
>>>>> for the TPEBS code, even though it is a very modern Intel feature.
>>>> I see.
>>>>
>>>> But the sampling support for SIMD registers is different with the sample
>>>> weight processing in the patch
>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
>>>> Each arch may support different kinds of SIMD registers and furthermore
>>>> each kind of SIMD register may have different register number and register
>>>> width. It's quite hard to figure out some common functions or fields to
>>>> represent the name and attributes of these arch-specific SIMD registers.
>>>> These arch-specific information can only be told by the arch-specific code.
>>>> So it looks the __weak functions are still the easiest way to implement this.
>>>>
>>>> I don't think the perf.data parsing would be broken from a platform to
>>>> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
>>>> To indicates the presence of SIMD registers in record data, a new ABI flag
>>>> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
>>>> platform is new enough and can recognize this new flag, then the SIMD
>>>> registers data would be parsed correctly. Even though the perf tool is old
>>>> and have no support of SIMD register, the data of SIMD registers would just
>>>> be silently ignored and should not break the parsing.
>>> That's good to know. I'm confused then why these functions can't just
>>> be within the arch directory? For example, we don't expose the
>>> intel-pt PMU code in the common code except for the parsing parts. A
>>> lot of that is handled by the default perf_event_attr initialization
>>> that every PMU can have its own variant of:
>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123
>> I see. From my point of view, there seems no essential difference between a
>> function pointer and a __weak function, and it looks hard to find a common
>> data structure to save all these function pointers which needs to be called
>> in different places, like register name parsing, register data dumpling ...
>>
>>
>>> Perhaps this is all just evidence of tech debt in the perf_regs.c code
>>> :-/ The bit that's relevant to the patch here is that I think this is
>>> adding to the tech debt problem as 11 more functions are added to
>>> perf_regs.h.
>> Yeah, 11 new __weak functions seems too much, we may merge the same kinds
>> of functions, like merging *_simd_reg_mask() and  *_pred_reg_mask() to a
>> single function with an type argument, then the new added __weak functions
>> could shrink half.
> There could be a good reason for 11 weak functions :-) In the
> perf_event.h you've added to the sample event:
> ```
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred
> * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_USER
> ```
> so these things are readable/writable outside of builds with arch/x86
> compiled in, which is why it seems odd that there needs to be arch
> code in the common code to handle them. Similar to how I needed to get
> the retirement latency parsing out of the arch/x86 directory as
> potentially you could be looking at a perf.data file with retirement
> latencies in it on a non-x86 platform.

Ian, I'm not sure if I fully get your point. If not, please correct.

Although these new introduced fields are generic and existed on all
architectures, it's not enough to get all the necessary information to dump
or parse the SIMD registers, e.g., the SIMD register name.

Let's take dumpling the sampled value of SIMD registers as an example.

We know there could be different kinds of SIMD register on different archs,
like XMM/YMM/ZMM on x86 and V-registers/Z-registers on ARM.

Currently we only know the register number and width from generic fields,
we have no way to directly know the exact name this SIMD register
corresponds. We have to involve the arch-specific function to figure out it
and then print them.

At least for now, it looks we still need these arch-specific functions ...


>
> Thanks,
> Ian
>
>>> Thanks,
>>> Ian
>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>
>>>>>
>>>>>>> Thanks,
>>>>>>> Ian
>>>>>>>
>>>>>>>> +
>>>>>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>>>>>  {
>>>>>>>>         const char *reg_name = NULL;
>>>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>>>>>> --- a/tools/perf/util/perf_regs.h
>>>>>>>> +++ b/tools/perf/util/perf_regs.h
>>>>>>>> @@ -24,9 +24,20 @@ enum {
>>>>>>>>  };
>>>>>>>>
>>>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>>>>>> +bool arch_has_simd_regs(u64 mask);
>>>>>>>>  uint64_t arch__intr_reg_mask(void);
>>>>>>>>  uint64_t arch__user_reg_mask(void);
>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>
>>>>>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>>>>>> --- a/tools/perf/util/record.h
>>>>>>>> +++ b/tools/perf/util/record.h
>>>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>>>>>         unsigned int  user_freq;
>>>>>>>>         u64           branch_stack;
>>>>>>>>         u64           sample_intr_regs;
>>>>>>>> +       u64           sample_intr_vec_regs;
>>>>>>>>         u64           sample_user_regs;
>>>>>>>> +       u64           sample_user_vec_regs;
>>>>>>>> +       u16           sample_pred_regs_qwords;
>>>>>>>> +       u16           sample_vec_regs_qwords;
>>>>>>>> +       u16           sample_intr_pred_regs;
>>>>>>>> +       u16           sample_user_pred_regs;
>>>>>>>>         u64           default_interval;
>>>>>>>>         u64           user_interval;
>>>>>>>>         size_t        auxtrace_snapshot_size;
>>>>>>>> --
>>>>>>>> 2.34.1
>>>>>>>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-04 18:59     ` Dave Hansen
@ 2025-12-05  8:42       ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2025-12-05  8:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Thu, Dec 04, 2025 at 10:59:15AM -0800, Dave Hansen wrote:
> On 12/4/25 07:17, Peter Zijlstra wrote:
> >> - Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
> >>   Some corner cases, such as an NMI occurring just after the flag switches
> >>   but still in kernel mode, cannot be handled.
> > Urgh.. Dave, Thomas, is there any reason we could not set
> > TIF_NEED_FPU_LOAD *after* doing the XSAVE (clearing is already done
> > after restore).
> > 
> > That way, when an NMI sees TIF_NEED_FPU_LOAD it knows the task copy is
> > consistent.
> 
> Something like the attached patch?
> 
> I think that would be just fine. save_fpregs_to_fpstate() doesn't
> actually change the need for TIF_NEED_FPU_LOAD, so I don't think the
> ordering matters.

Right, I missed this one. And yes, I couldn't find any site where this
ordering mattered either. Its all with interrupts disabled, so normally
it all goes together. Only the NMI could observe the difference.

> diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
> index 89004f4ca208..2d57a7bf5406 100644
> --- a/arch/x86/include/asm/fpu/sched.h
> +++ b/arch/x86/include/asm/fpu/sched.h
> @@ -36,8 +36,8 @@ static inline void switch_fpu(struct task_struct *old, int cpu)
>  	    !(old->flags & (PF_KTHREAD | PF_USER_WORKER))) {
>  		struct fpu *old_fpu = x86_task_fpu(old);
>  
> -		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
>  		save_fpregs_to_fpstate(old_fpu);
> +		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
>  		/*
>  		 * The save operation preserved register state, so the
>  		 * fpu_fpregs_owner_ctx is still @old_fpu. Store the


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 07/19] perf: Add sampling support for SIMD registers
  2025-12-03  6:54 ` [Patch v5 07/19] perf: Add sampling support for SIMD registers Dapeng Mi
@ 2025-12-05 11:07   ` Peter Zijlstra
  2025-12-08  5:24     ` Mi, Dapeng
  2025-12-05 11:40   ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2025-12-05 11:07 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:48PM +0800, Dapeng Mi wrote:

> @@ -545,6 +547,25 @@ struct perf_event_attr {
>  	__u64	sig_data;
>  
>  	__u64	config3; /* extension of config2 */
> +
> +
> +	/*
> +	 * Defines set of SIMD registers to dump on samples.
> +	 * The sample_simd_regs_enabled !=0 implies the
> +	 * set of SIMD registers is used to config all SIMD registers.
> +	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> +	 * config some SIMD registers on X86.
> +	 */
> +	union {
> +		__u16 sample_simd_regs_enabled;
> +		__u16 sample_simd_pred_reg_qwords;
> +	};
> +	__u32 sample_simd_pred_reg_intr;
> +	__u32 sample_simd_pred_reg_user;
> +	__u16 sample_simd_vec_reg_qwords;
> +	__u64 sample_simd_vec_reg_intr;
> +	__u64 sample_simd_vec_reg_user;
> +	__u32 __reserved_4;
>  };

This is poorly aligned and causes holes.

This:

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d292f96bc06f..2deb8dd0ca37 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -545,6 +545,14 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+	__u16	sample_simd_pred_reg_qwords;
+	__u32	sample_simd_pred_reg_intr;
+	__u32	sample_simd_pred_reg_user;
+	__u16	sample_simd_vec_reg_qwords;
+	__u64	sample_simd_vec_reg_intr;
+	__u64	sample_simd_vec_reg_user;
+	__u32	__reserved_4;
 };
 
 /*

results in:

        __u64                      config3;              /*   128     8 */
        __u16                      sample_simd_pred_reg_qwords; /*   136     2 */

        /* XXX 2 bytes hole, try to pack */

        __u32                      sample_simd_pred_reg_intr; /*   140     4 */
        __u32                      sample_simd_pred_reg_user; /*   144     4 */
        __u16                      sample_simd_vec_reg_qwords; /*   148     2 */

        /* XXX 2 bytes hole, try to pack */

        __u64                      sample_simd_vec_reg_intr; /*   152     8 */
        __u64                      sample_simd_vec_reg_user; /*   160     8 */
        __u32                      __reserved_4;         /*   168     4 */



A better layout might be:

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d292f96bc06f..f72707e9df68 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -545,6 +545,15 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+	__u16	sample_simd_pred_reg_qwords;
+	__u16	sample_simd_vec_reg_qwords;
+	__u32	__reserved_4;
+
+	__u32	sample_simd_pred_reg_intr;
+	__u32	sample_simd_pred_reg_user;
+	__u64	sample_simd_vec_reg_intr;
+	__u64	sample_simd_vec_reg_user;
 };
 
 /*

such that:

        __u64                      config3;              /*   128     8 */
        __u16                      sample_simd_pred_reg_qwords; /*   136     2 */
        __u16                      sample_simd_vec_reg_qwords; /*   138     2 */
        __u32                      __reserved_4;         /*   140     4 */
        __u32                      sample_simd_pred_reg_intr; /*   144     4 */
        __u32                      sample_simd_pred_reg_user; /*   148     4 */
        __u64                      sample_simd_vec_reg_intr; /*   152     8 */
        __u64                      sample_simd_vec_reg_user; /*   160     8 */



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
  2025-12-03  6:54 ` [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
@ 2025-12-05 11:25   ` Peter Zijlstra
  2025-12-08  6:10     ` Mi, Dapeng
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2025-12-05 11:25 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:49PM +0800, Dapeng Mi wrote:

> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
> index 7c9d2bb3833b..c3862e5fdd6d 100644
> --- a/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/arch/x86/include/uapi/asm/perf_regs.h
> @@ -55,4 +55,21 @@ enum perf_event_x86_regs {
>  
>  #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
>  
> +enum {
> +	PERF_REG_X86_XMM,
> +	PERF_REG_X86_MAX_SIMD_REGS,
> +};
> +
> +enum {
> +	PERF_X86_SIMD_XMM_REGS      = 16,
> +	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
> +};
> +
> +#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
> +
> +enum {
> +	PERF_X86_XMM_QWORDS      = 2,
> +	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
> +};
> +
>  #endif /* _ASM_X86_PERF_REGS_H */

I don't understand this bit -- the next few patches add to it for YMM
and ZMM, but what's the point? I don't see why this is needed at all,
let alone why it needs to be UABI.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 07/19] perf: Add sampling support for SIMD registers
  2025-12-03  6:54 ` [Patch v5 07/19] perf: Add sampling support for SIMD registers Dapeng Mi
  2025-12-05 11:07   ` Peter Zijlstra
@ 2025-12-05 11:40   ` Peter Zijlstra
  2025-12-08  6:00     ` Mi, Dapeng
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2025-12-05 11:40 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:48PM +0800, Dapeng Mi wrote:

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3e9c48fa2202..b19de038979e 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7469,6 +7469,50 @@ perf_output_sample_regs(struct perf_output_handle *handle,
>  	}
>  }
>  
> +static void
> +perf_output_sample_simd_regs(struct perf_output_handle *handle,
> +			     struct perf_event *event,
> +			     struct pt_regs *regs,
> +			     u64 mask, u16 pred_mask)
> +{
> +	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
> +	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
> +	u64 pred_bitmap = pred_mask;
> +	u64 bitmap = mask;
> +	u16 nr_vectors;
> +	u16 nr_pred;
> +	int bit;
> +	u64 val;
> +	u16 i;
> +
> +	nr_vectors = hweight64(bitmap);
> +	nr_pred = hweight64(pred_bitmap);
> +
> +	perf_output_put(handle, nr_vectors);
> +	perf_output_put(handle, vec_qwords);
> +	perf_output_put(handle, nr_pred);
> +	perf_output_put(handle, pred_qwords);
> +
> +	if (nr_vectors) {
> +		for_each_set_bit(bit, (unsigned long *)&bitmap,

This isn't right. Yes we do this all the time in the x86 code, but there
we can assume little-endian byte order. This is core code and is also
used on big-endian systems where this is very much broken.

> +				 sizeof(bitmap) * BITS_PER_BYTE) {
> +			for (i = 0; i < vec_qwords; i++) {
> +				val = perf_simd_reg_value(regs, bit, i, false);
> +				perf_output_put(handle, val);
> +			}
> +		}
> +	}
> +	if (nr_pred) {
> +		for_each_set_bit(bit, (unsigned long *)&pred_bitmap,
> +				 sizeof(pred_bitmap) * BITS_PER_BYTE) {
> +			for (i = 0; i < pred_qwords; i++) {
> +				val = perf_simd_reg_value(regs, bit, i, true);
> +				perf_output_put(handle, val);
> +			}
> +		}
> +	}
> +}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields
  2025-12-03  6:54 ` [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields Dapeng Mi
@ 2025-12-05 12:16   ` Peter Zijlstra
  2025-12-08  6:11     ` Mi, Dapeng
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2025-12-05 12:16 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:53PM +0800, Dapeng Mi wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> This patch enables sampling of APX eGPRs (R16 ~ R31) via the
> sample_regs_* fields.
> 
> To sample eGPRs, the sample_simd_regs_enabled field must be set. This
> allows the spare space (reclaimed from the original XMM space) in the
> sample_regs_* fields to be used for representing eGPRs.
> 
> The perf_reg_value() function needs to check if the
> PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then determine whether
> to output eGPRs or legacy XMM registers to userspace.
> 
> The perf_reg_validate() function is enhanced to validate the eGPRs bitmap
> by adding a new argument, "simd_enabled".
> 
> Currently, eGPRs sampling is only supported on the x86_64 architecture, as
> APX is only available on x86_64 platforms.
> 
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/arm/kernel/perf_regs.c           |  2 +-
>  arch/arm64/kernel/perf_regs.c         |  2 +-
>  arch/csky/kernel/perf_regs.c          |  2 +-
>  arch/loongarch/kernel/perf_regs.c     |  2 +-
>  arch/mips/kernel/perf_regs.c          |  2 +-
>  arch/parisc/kernel/perf_regs.c        |  2 +-
>  arch/powerpc/perf/perf_regs.c         |  2 +-
>  arch/riscv/kernel/perf_regs.c         |  2 +-
>  arch/s390/kernel/perf_regs.c          |  2 +-

Perhaps split out the part where you modify the arch function interface?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 13/19] perf/x86: Enable SSP sampling using sample_regs_* fields
  2025-12-03  6:54 ` [Patch v5 13/19] perf/x86: Enable SSP " Dapeng Mi
@ 2025-12-05 12:20   ` Peter Zijlstra
  2025-12-08  6:21     ` Mi, Dapeng
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2025-12-05 12:20 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:54PM +0800, Dapeng Mi wrote:
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index ca242db3720f..c925af4160ad 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -729,6 +729,10 @@ struct x86_perf_regs {
>  		u64	*egpr_regs;
>  		struct apx_state *egpr;
>  	};
> +	union {
> +		u64	*cet_regs;
> +		struct cet_user_state *cet;
> +	};
>  };

Are we envisioning more than just SSP?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-03  6:54 ` [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
@ 2025-12-05 12:39   ` Peter Zijlstra
  2025-12-07 20:44     ` Andi Kleen
  2025-12-08  6:46     ` Mi, Dapeng
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2025-12-05 12:39 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

On Wed, Dec 03, 2025 at 02:54:57PM +0800, Dapeng Mi wrote:
> When two or more identical PEBS events with the same sampling period are
> programmed on a mix of PDIST and non-PDIST counters, multiple
> back-to-back NMIs can be triggered.

This is a hardware defect -- albeit a fairly common one.


> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index da48bcde8fce..a130d3f14844 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3351,8 +3351,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>  	 */
>  	if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
>  				 (unsigned long *)&status)) {
> -		handled++;
> -		static_call(x86_pmu_drain_pebs)(regs, &data);
> +		handled += static_call(x86_pmu_drain_pebs)(regs, &data);
>  
>  		if (cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS] &&
>  		    is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS]))

Note that the old code would return handled++, while the new code:

> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index a01c72c03bd6..c7cdcd585574 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -2759,7 +2759,7 @@ __intel_pmu_pebs_events(struct perf_event *event,
>  	__intel_pmu_pebs_last_event(event, iregs, regs, data, at, count, setup_sample);
>  }
>  
> -static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
> +static int intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
>  {
>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>  	struct debug_store *ds = cpuc->ds;
> @@ -2768,7 +2768,7 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
>  	int n;
>  
>  	if (!x86_pmu.pebs_active)
> -		return;
> +		return 0;
>  
>  	at  = (struct pebs_record_core *)(unsigned long)ds->pebs_buffer_base;
>  	top = (struct pebs_record_core *)(unsigned long)ds->pebs_index;
> @@ -2779,22 +2779,24 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
>  	ds->pebs_index = ds->pebs_buffer_base;
>  
>  	if (!test_bit(0, cpuc->active_mask))
> -		return;
> +		return 0;
>  
>  	WARN_ON_ONCE(!event);
>  
>  	if (!event->attr.precise_ip)
> -		return;
> +		return 0;
>  
>  	n = top - at;
>  	if (n <= 0) {
>  		if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
>  			intel_pmu_save_and_restart_reload(event, 0);
> -		return;
> +		return 0;
>  	}
>  
>  	__intel_pmu_pebs_events(event, iregs, data, at, top, 0, n,
>  				setup_pebs_fixed_sample_data);
> +
> +	return 0;
>  }
>  
>  static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64 mask)
> @@ -2817,7 +2819,7 @@ static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64
>  	}
>  }
>  
> -static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
> +static int intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
>  {
>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>  	struct debug_store *ds = cpuc->ds;
> @@ -2830,7 +2832,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>  	u64 mask;
>  
>  	if (!x86_pmu.pebs_active)
> -		return;
> +		return 0;
>  
>  	base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
>  	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
> @@ -2846,7 +2848,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>  
>  	if (unlikely(base >= top)) {
>  		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
> -		return;
> +		return 0;
>  	}
>  
>  	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> @@ -2931,6 +2933,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>  						setup_pebs_fixed_sample_data);
>  		}
>  	}
> +
> +	return 0;
>  }
>  
>  static __always_inline void
> @@ -2984,7 +2988,7 @@ __intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
>  
>  }
>  
> -static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
> +static int intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
>  {
>  	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
>  	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
> @@ -2997,7 +3001,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>  	u64 mask;
>  
>  	if (!x86_pmu.pebs_active)
> -		return;
> +		return 0;
>  
>  	base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
>  	top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
> @@ -3010,7 +3014,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>  
>  	if (unlikely(base >= top)) {
>  		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
> -		return;
> +		return 0;
>  	}
>  
>  	if (!iregs)
> @@ -3032,9 +3036,11 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>  
>  	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
>  					    setup_pebs_adaptive_sample_data);
> +
> +	return 0;
>  }

will now return handled+=0 for all these. Which is a change in
behaviour. Also:

> -static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
> +static int intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>  				      struct perf_sample_data *data)
>  {
>  	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
> @@ -3044,13 +3050,14 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>  	struct x86_perf_regs perf_regs;
>  	struct pt_regs *regs = &perf_regs.regs;
>  	void *base, *at, *top;
> +	u64 events_bitmap = 0;
>  	u64 mask;
>  
>  	rdmsrq(MSR_IA32_PEBS_INDEX, index.whole);
>  
>  	if (unlikely(!index.wr)) {
>  		intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
> -		return;
> +		return 0;
>  	}
>  
>  	base = cpuc->pebs_vaddr;
> @@ -3089,6 +3096,7 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>  
>  		basic = at + sizeof(struct arch_pebs_header);
>  		pebs_status = mask & basic->applicable_counters;
> +		events_bitmap |= pebs_status;
>  		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
>  					       pebs_status, counts, last,
>  					       setup_arch_pebs_sample_data);
> @@ -3108,6 +3116,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>  	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask,
>  					    counts, last,
>  					    setup_arch_pebs_sample_data);
> +
	/*
	 * Comment that explains the arch pebs defect goes here.
	 */
> +	return hweight64(events_bitmap);
>  }

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-05  8:10                 ` Mi, Dapeng
@ 2025-12-05 16:35                   ` Ian Rogers
  2025-12-08  4:20                     ` Mi, Dapeng
  0 siblings, 1 reply; 55+ messages in thread
From: Ian Rogers @ 2025-12-05 16:35 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Fri, Dec 5, 2025 at 12:10 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 12/5/2025 2:38 PM, Ian Rogers wrote:
> > On Thu, Dec 4, 2025 at 8:00 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 12/5/2025 12:16 AM, Ian Rogers wrote:
> >>> On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>> On 12/4/2025 3:49 PM, Ian Rogers wrote:
> >>>>> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>>>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
> >>>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >>>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>>>>>
> >>>>>>>> This patch adds support for the newly introduced SIMD register sampling
> >>>>>>>> format by adding the following functions:
> >>>>>>>>
> >>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
> >>>>>>>> uint64_t arch__user_simd_reg_mask(void);
> >>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
> >>>>>>>> uint64_t arch__user_pred_reg_mask(void);
> >>>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >>>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >>>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >>>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
> >>>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
> >>>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >>>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
> >>>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >>>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >>>>>>>> OPMASK).
> >>>>>>>>
> >>>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
> >>>>>>>> these newly introduced SIMD registers. Currently, each type of register
> >>>>>>>> can only be sampled collectively; sampling a specific SIMD register is
> >>>>>>>> not supported. For example, all XMM registers are sampled together rather
> >>>>>>>> than sampling only XMM0.
> >>>>>>>>
> >>>>>>>> When multiple overlapping register types, such as XMM and YMM, are
> >>>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>>>>>>>
> >>>>>>>> With this patch, all supported sampling registers on x86 platforms are
> >>>>>>>> displayed as follows.
> >>>>>>>>
> >>>>>>>>  $perf record -I?
> >>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>>>
> >>>>>>>>  $perf record --user-regs=?
> >>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>>>
> >>>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>>>> ---
> >>>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>>>>>>>  tools/perf/util/evsel.c                   |  27 ++
> >>>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
> >>>>>>>>  tools/perf/util/perf_regs.h               |  11 +
> >>>>>>>>  tools/perf/util/record.h                  |   6 +
> >>>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>>>> index 12fd93f04802..db41430f3b07 100644
> >>>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
> >>>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>>>> @@ -13,6 +13,49 @@
> >>>>>>>>  #include "../../../util/pmu.h"
> >>>>>>>>  #include "../../../util/pmus.h"
> >>>>>>>>
> >>>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
> >>>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >>>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >>>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >>>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >>>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >>>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >>>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >>>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >>>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >>>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >>>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >>>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >>>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >>>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >>>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >>>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >>>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >>>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >>>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >>>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >>>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >>>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >>>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >>>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >>>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >>>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >>>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >>>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >>>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >>>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >>>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >>>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >>>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >>>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >>>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >>>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >>>>>>>> +#endif
> >>>>>>>> +       SMPL_REG_END
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>>>>>>>         return SDT_ARG_VALID;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> >>>>>>> To make the code easier to read, it'd be nice to document sample_type,
> >>>>>>> qwords and mask here.
> >>>>>> Sure.
> >>>>>>
> >>>>>>
> >>>>>>>> +{
> >>>>>>>> +       struct perf_event_attr attr = {
> >>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>>>> +               .sample_type                    = sample_type,
> >>>>>>>> +               .disabled                       = 1,
> >>>>>>>> +               .exclude_kernel                 = 1,
> >>>>>>>> +               .sample_simd_regs_enabled       = 1,
> >>>>>>>> +       };
> >>>>>>>> +       int fd;
> >>>>>>>> +
> >>>>>>>> +       attr.sample_period = 1;
> >>>>>>>> +
> >>>>>>>> +       if (!pred) {
> >>>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
> >>>>>>>> +               else
> >>>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
> >>>>>>>> +       } else {
> >>>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >>>>>>>> +               else
> >>>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
> >>>>>>>> +               struct perf_pmu *pmu = NULL;
> >>>>>>>> +               __u64 type = PERF_TYPE_RAW;
> >>>>>>> It should be okay to do:
> >>>>>>> __u64 type = perf_pmus__find_core_pmu()->type
> >>>>>>> rather than have the whole loop below.
> >>>>>> Sure. Thanks.
> >>>>>>
> >>>>>>
> >>>>>>>> +
> >>>>>>>> +               /*
> >>>>>>>> +                * The same register set is supported among different hybrid PMUs.
> >>>>>>>> +                * Only check the first available one.
> >>>>>>>> +                */
> >>>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >>>>>>>> +                       type = pmu->type;
> >>>>>>>> +                       break;
> >>>>>>>> +               }
> >>>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       event_attr_init(&attr);
> >>>>>>>> +
> >>>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>>>> +       if (fd != -1) {
> >>>>>>>> +               close(fd);
> >>>>>>>> +               return true;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return false;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       bool supported = false;
> >>>>>>>> +       u64 bits;
> >>>>>>>> +
> >>>>>>>> +       *mask = 0;
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +
> >>>>>>>> +       switch (reg) {
> >>>>>>>> +       case PERF_REG_X86_XMM:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       case PERF_REG_X86_YMM:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       case PERF_REG_X86_ZMM:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >>>>>>>> +                       break;
> >>>>>>>> +               }
> >>>>>>>> +
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       default:
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return supported;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       bool supported = false;
> >>>>>>>> +       u64 bits;
> >>>>>>>> +
> >>>>>>>> +       *mask = 0;
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +
> >>>>>>>> +       switch (reg) {
> >>>>>>>> +       case PERF_REG_X86_OPMASK:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       default:
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return supported;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool has_cap_simd_regs(void)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >>>>>>>> +       static bool has_cap_simd_regs;
> >>>>>>>> +       static bool cached;
> >>>>>>>> +
> >>>>>>>> +       if (cached)
> >>>>>>>> +               return has_cap_simd_regs;
> >>>>>>>> +
> >>>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>>>> +       cached = true;
> >>>>>>>> +
> >>>>>>>> +       return has_cap_simd_regs;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +bool arch_has_simd_regs(u64 mask)
> >>>>>>>> +{
> >>>>>>>> +       return has_cap_simd_regs() &&
> >>>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
> >>>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >>>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >>>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >>>>>>>> +       SMPL_REG_END
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
> >>>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >>>>>>>> +       SMPL_REG_END
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_simd_reg_masks;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_pred_reg_masks;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool x86_intr_simd_updated;
> >>>>>>>> +static u64 x86_intr_simd_reg_mask;
> >>>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>> Could we add some comments? I can kind of figure out the updated is a
> >>>>>>> check for lazy initialization and what masks are, qwords is an odd
> >>>>>>> one. The comment could also point out that SIMD doesn't mean the
> >>>>>>> machine supports SIMD, but SIMD registers are supported in perf
> >>>>>>> events.
> >>>>>> Sure.
> >>>>>>
> >>>>>>
> >>>>>>>> +static bool x86_user_simd_updated;
> >>>>>>>> +static u64 x86_user_simd_reg_mask;
> >>>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>>> +
> >>>>>>>> +static bool x86_intr_pred_updated;
> >>>>>>>> +static u64 x86_intr_pred_reg_mask;
> >>>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +static bool x86_user_pred_updated;
> >>>>>>>> +static u64 x86_user_pred_reg_mask;
> >>>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +
> >>>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool supported;
> >>>>>>>> +       u64 mask = 0;
> >>>>>>>> +       int reg;
> >>>>>>>> +
> >>>>>>>> +       if (!has_cap_simd_regs())
> >>>>>>>> +               return 0;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >>>>>>>> +               return x86_intr_simd_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >>>>>>>> +               return x86_user_simd_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>>>> +               supported = false;
> >>>>>>>> +
> >>>>>>>> +               if (!r->mask)
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>>>> +
> >>>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >>>>>>>> +                       break;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_intr_simd_mask[reg],
> >>>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
> >>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_user_simd_mask[reg],
> >>>>>>>> +                                                        &x86_user_simd_qwords[reg]);
> >>>>>>>> +               if (supported)
> >>>>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>>>> +               x86_intr_simd_reg_mask = mask;
> >>>>>>>> +               x86_intr_simd_updated = true;
> >>>>>>>> +       } else {
> >>>>>>>> +               x86_user_simd_reg_mask = mask;
> >>>>>>>> +               x86_user_simd_updated = true;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool supported;
> >>>>>>>> +       u64 mask = 0;
> >>>>>>>> +       int reg;
> >>>>>>>> +
> >>>>>>>> +       if (!has_cap_simd_regs())
> >>>>>>>> +               return 0;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >>>>>>>> +               return x86_intr_pred_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >>>>>>>> +               return x86_user_pred_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>>>> +               supported = false;
> >>>>>>>> +
> >>>>>>>> +               if (!r->mask)
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>>>> +
> >>>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >>>>>>>> +                       break;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_intr_pred_mask[reg],
> >>>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
> >>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_user_pred_mask[reg],
> >>>>>>>> +                                                        &x86_user_pred_qwords[reg]);
> >>>>>>>> +               if (supported)
> >>>>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>>>> +               x86_intr_pred_reg_mask = mask;
> >>>>>>>> +               x86_intr_pred_updated = true;
> >>>>>>>> +       } else {
> >>>>>>>> +               x86_user_pred_reg_mask = mask;
> >>>>>>>> +               x86_user_pred_updated = true;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>> This feels repetitive with __arch__simd_reg_mask, could they be
> >>>>>>> refactored together?
> >>>>>> hmm, it looks we can extract the for loop as a common function. The other
> >>>>>> parts are hard to be generalized since they are manipulating different
> >>>>>> variables. If we want to generalize them, we have to introduce lots of "if
> >>>>>> ... else" branches and that would make code hard to be read.
> >>>>>>
> >>>>>>
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = 0;
> >>>>>>>> +
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >>>>>>>> +               if (intr) {
> >>>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
> >>>>>>>> +                       mask = x86_intr_simd_mask[reg];
> >>>>>>>> +               } else {
> >>>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
> >>>>>>>> +                       mask = x86_user_simd_mask[reg];
> >>>>>>>> +               }
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = 0;
> >>>>>>>> +
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >>>>>>>> +               if (intr) {
> >>>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
> >>>>>>>> +                       mask = x86_intr_pred_mask[reg];
> >>>>>>>> +               } else {
> >>>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
> >>>>>>>> +                       mask = x86_user_pred_mask[reg];
> >>>>>>>> +               }
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_intr_simd_updated)
> >>>>>>>> +               arch__intr_simd_reg_mask();
> >>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_user_simd_updated)
> >>>>>>>> +               arch__user_simd_reg_mask();
> >>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_intr_pred_updated)
> >>>>>>>> +               arch__intr_pred_reg_mask();
> >>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_user_pred_updated)
> >>>>>>>> +               arch__user_pred_reg_mask();
> >>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
> >>>>>>>>  {
> >>>>>>>> +       if (has_cap_simd_regs())
> >>>>>>>> +               return sample_reg_masks_ext;
> >>>>>>>>         return sample_reg_masks;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> -uint64_t arch__intr_reg_mask(void)
> >>>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>>>>>>>  {
> >>>>>>>>         struct perf_event_attr attr = {
> >>>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
> >>>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >>>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >>>>>>>> -               .precise_ip             = 1,
> >>>>>>>> -               .disabled               = 1,
> >>>>>>>> -               .exclude_kernel         = 1,
> >>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>>>> +               .sample_type                    = sample_type,
> >>>>>>>> +               .precise_ip                     = 1,
> >>>>>>>> +               .disabled                       = 1,
> >>>>>>>> +               .exclude_kernel                 = 1,
> >>>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>>>>>>>         };
> >>>>>>>>         int fd;
> >>>>>>>>         /*
> >>>>>>>>          * In an unnamed union, init it here to build on older gcc versions
> >>>>>>>>          */
> >>>>>>>>         attr.sample_period = 1;
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +               attr.sample_regs_intr = mask;
> >>>>>>>> +       else
> >>>>>>>> +               attr.sample_regs_user = mask;
> >>>>>>>>
> >>>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
> >>>>>>>>                 struct perf_pmu *pmu = NULL;
> >>>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>>>>         if (fd != -1) {
> >>>>>>>>                 close(fd);
> >>>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >>>>>>>> +               return mask;
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>> -       return PERF_REGS_MASK;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>>>> +
> >>>>>>>> +       if (has_cap_simd_regs()) {
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>>>> +                                        true);
> >>>>>>> It's nice to label constant arguments like this something like:
> >>>>>>> /*has_simd_regs=*/true);
> >>>>>>>
> >>>>>>> Tools like clang-tidy even try to enforce the argument names match the comments.
> >>>>>> Sure.
> >>>>>>
> >>>>>>
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>>>> +                                        true);
> >>>>>>>> +       } else
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>>  uint64_t arch__user_reg_mask(void)
> >>>>>>>>  {
> >>>>>>>> -       return PERF_REGS_MASK;
> >>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>>>> +
> >>>>>>>> +       if (has_cap_simd_regs()) {
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>>>> +                                        true);
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>>>> +                                        true);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>> The code is repetitive here, could we refactor into a single function
> >>>>>>> passing in a user or instr value?
> >>>>>> Sure. Would extract the common part.
> >>>>>>
> >>>>>>
> >>>>>>>>  }
> >>>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
> >>>>>>>> --- a/tools/perf/util/evsel.c
> >>>>>>>> +++ b/tools/perf/util/evsel.c
> >>>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>>>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >>>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >>>>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>>>> +               else
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>>>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
> >>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >>>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>>>> +               else
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >>>>>>>> index cda1c620968e..0bd100392889 100644
> >>>>>>>> --- a/tools/perf/util/parse-regs-options.c
> >>>>>>>> +++ b/tools/perf/util/parse-regs-options.c
> >>>>>>>> @@ -4,19 +4,139 @@
> >>>>>>>>  #include <stdint.h>
> >>>>>>>>  #include <string.h>
> >>>>>>>>  #include <stdio.h>
> >>>>>>>> +#include <linux/bitops.h>
> >>>>>>>>  #include "util/debug.h"
> >>>>>>>>  #include <subcmd/parse-options.h>
> >>>>>>>>  #include "util/perf_regs.h"
> >>>>>>>>  #include "util/parse-regs-options.h"
> >>>>>>>> +#include "record.h"
> >>>>>>>> +
> >>>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       if (!simd_mask)
> >>>>>>>> +               return;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (!(r->mask & simd_mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               if (bitmap)
> >>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>>>> +       }
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       if (!pred_mask)
> >>>>>>>> +               return;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (!(r->mask & pred_mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               if (bitmap)
> >>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>>>> +       }
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool matched = false;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (strcasecmp(s, r->name))
> >>>>>>>> +                       continue;
> >>>>>>>> +               if (!fls64(r->mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               matched = true;
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       /* Just need the highest qwords */
> >>>>>>> I'm not following here. Does the bitmap need to handle gaps?
> >>>>>> Currently no. In theory, the kernel supports user space only samples a
> >>>>>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
> >>>>>> supports 16 XMM registers on XMM), but it's not supported to avoid
> >>>>>> introducing too much complexity in perf tools. Moreover, I don't think end
> >>>>>> users have such requirement. In most cases, users should only know which
> >>>>>> kinds of SIMD registers their programs use but usually doesn't know and
> >>>>>> care about which exact SIMD register is used.
> >>>>>>
> >>>>>>
> >>>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
> >>>>>>>> +               opts->sample_vec_regs_qwords = qwords;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
> >>>>>>>> +               else
> >>>>>>>> +                       opts->sample_user_vec_regs = bitmap;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return matched;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool matched = false;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (strcasecmp(s, r->name))
> >>>>>>>> +                       continue;
> >>>>>>>> +               if (!fls64(r->mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               matched = true;
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       /* Just need the highest qwords */
> >>>>>>> Again repetitive, could we have a single function?
> >>>>>> Yes, I suppose the for loop at least can be extracted as a common function.
> >>>>>>
> >>>>>>
> >>>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
> >>>>>>>> +               opts->sample_pred_regs_qwords = qwords;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
> >>>>>>>> +               else
> >>>>>>>> +                       opts->sample_user_pred_regs = bitmap;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return matched;
> >>>>>>>> +}
> >>>>>>>>
> >>>>>>>>  static int
> >>>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>  {
> >>>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
> >>>>>>>>         const struct sample_reg *r = NULL;
> >>>>>>>> +       struct record_opts *opts;
> >>>>>>>>         char *s, *os = NULL, *p;
> >>>>>>>> -       int ret = -1;
> >>>>>>>> +       bool has_simd_regs = false;
> >>>>>>>>         uint64_t mask;
> >>>>>>>> +       uint64_t simd_mask;
> >>>>>>>> +       uint64_t pred_mask;
> >>>>>>>> +       int ret = -1;
> >>>>>>>>
> >>>>>>>>         if (unset)
> >>>>>>>>                 return 0;
> >>>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>         if (*mode)
> >>>>>>>>                 return -1;
> >>>>>>>>
> >>>>>>>> -       if (intr)
> >>>>>>>> +       if (intr) {
> >>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>>>>>>>                 mask = arch__intr_reg_mask();
> >>>>>>>> -       else
> >>>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
> >>>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
> >>>>>>>> +       } else {
> >>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>>>>>>>                 mask = arch__user_reg_mask();
> >>>>>>>> +               simd_mask = arch__user_simd_reg_mask();
> >>>>>>>> +               pred_mask = arch__user_pred_reg_mask();
> >>>>>>>> +       }
> >>>>>>>>
> >>>>>>>>         /* str may be NULL in case no arg is passed to -I */
> >>>>>>>>         if (str) {
> >>>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>                                         if (r->mask & mask)
> >>>>>>>>                                                 fprintf(stderr, "%s ", r->name);
> >>>>>>>>                                 }
> >>>>>>>> +                               __print_simd_regs(intr, simd_mask);
> >>>>>>>> +                               __print_pred_regs(intr, pred_mask);
> >>>>>>>>                                 fputc('\n', stderr);
> >>>>>>>>                                 /* just printing available regs */
> >>>>>>>>                                 goto error;
> >>>>>>>>                         }
> >>>>>>>> +
> >>>>>>>> +                       if (simd_mask) {
> >>>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >>>>>>>> +                               if (has_simd_regs)
> >>>>>>>> +                                       goto next;
> >>>>>>>> +                       }
> >>>>>>>> +                       if (pred_mask) {
> >>>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >>>>>>>> +                               if (has_simd_regs)
> >>>>>>>> +                                       goto next;
> >>>>>>>> +                       }
> >>>>>>>> +
> >>>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>>>>>>>                                         break;
> >>>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>                         }
> >>>>>>>>
> >>>>>>>>                         *mode |= r->mask;
> >>>>>>>> -
> >>>>>>>> +next:
> >>>>>>>>                         if (!p)
> >>>>>>>>                                 break;
> >>>>>>>>
> >>>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>         ret = 0;
> >>>>>>>>
> >>>>>>>>         /* default to all possible regs */
> >>>>>>>> -       if (*mode == 0)
> >>>>>>>> +       if (*mode == 0 && !has_simd_regs)
> >>>>>>>>                 *mode = mask;
> >>>>>>>>  error:
> >>>>>>>>         free(os);
> >>>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>>>> index 66b666d9ce64..fb0366d050cf 100644
> >>>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>>>>>>>
> >>>>>>>>         return ret;
> >>>>>>>>  }
> >>>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >>>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
> >>>>>>>> --- a/tools/perf/util/perf_regs.c
> >>>>>>>> +++ b/tools/perf/util/perf_regs.c
> >>>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>>>>>>>         return SDT_ARG_SKIP;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >>>>>>>> +{
> >>>>>>>> +       return false;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
> >>>>>>>>  {
> >>>>>>>>         return 0;
> >>>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>>>>>>>         return 0;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>>>         SMPL_REG_END
> >>>>>>>>  };
> >>>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>>>>>>>         return sample_reg_masks;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_reg_masks;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_reg_masks;
> >>>>>>>> +}
> >>>>>>> Thinking out loud. I wonder if there is a way to hide the weak
> >>>>>>> functions. It seems the support is tied to PMUs, particularly core
> >>>>>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
> >>>>>>> ask the PMU to parse the register strings, set up the perf_event_attr,
> >>>>>>> etc. I'm somewhat scared these functions will be used on the report
> >>>>>>> rather than record side of things, thereby breaking perf.data support
> >>>>>>> when the host kernel does or doesn't have the SIMD support.
> >>>>>> Ian, I don't quite follow your words.
> >>>>>>
> >>>>>> I don't quite understand how should we do for "push things into pmu and
> >>>>>> arch pmu code". Current SIMD registers support follows the same way of the
> >>>>>> general registers support. If we intend to change the way entirely, we'd
> >>>>>> better have an independent patch-set.
> >>>>>>
> >>>>>> why these functions would break the perf.data repport? perf-report would
> >>>>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
> >>>>>> the flags is set (indicates there are SIMD registers data appended in the
> >>>>>> record), perf-report would try to parse the SIMD registers data.
> >>>>> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
> >>>>> remove weak symbols like:
> >>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
> >>>>>
> >>>>> For these patches what I'm imagining is that there is a Nova Lake
> >>>>> generated perf.data file. Using perf report, script, etc. on the Nova
> >>>>> Lake should expose all of the same mask, qword, etc. values as when
> >>>>> the perf.data was generated and so things will work. If the perf.data
> >>>>> file was taken to say my Alderlake then what will happen? Generally
> >>>>> using the arch directory and weak symbols is a code smell that cross
> >>>>> platform things are going to break - there should be sufficient data
> >>>>> in the event and the perf_event_attr to fully decode what's going on.
> >>>>> Sometimes tying things to a PMU name can avoid the use of the arch
> >>>>> directory. We were able to avoid the arch directory to a good extent
> >>>>> for the TPEBS code, even though it is a very modern Intel feature.
> >>>> I see.
> >>>>
> >>>> But the sampling support for SIMD registers is different with the sample
> >>>> weight processing in the patch
> >>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
> >>>> Each arch may support different kinds of SIMD registers and furthermore
> >>>> each kind of SIMD register may have different register number and register
> >>>> width. It's quite hard to figure out some common functions or fields to
> >>>> represent the name and attributes of these arch-specific SIMD registers.
> >>>> These arch-specific information can only be told by the arch-specific code.
> >>>> So it looks the __weak functions are still the easiest way to implement this.
> >>>>
> >>>> I don't think the perf.data parsing would be broken from a platform to
> >>>> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
> >>>> To indicates the presence of SIMD registers in record data, a new ABI flag
> >>>> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
> >>>> platform is new enough and can recognize this new flag, then the SIMD
> >>>> registers data would be parsed correctly. Even though the perf tool is old
> >>>> and have no support of SIMD register, the data of SIMD registers would just
> >>>> be silently ignored and should not break the parsing.
> >>> That's good to know. I'm confused then why these functions can't just
> >>> be within the arch directory? For example, we don't expose the
> >>> intel-pt PMU code in the common code except for the parsing parts. A
> >>> lot of that is handled by the default perf_event_attr initialization
> >>> that every PMU can have its own variant of:
> >>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123
> >> I see. From my point of view, there seems no essential difference between a
> >> function pointer and a __weak function, and it looks hard to find a common
> >> data structure to save all these function pointers which needs to be called
> >> in different places, like register name parsing, register data dumpling ...
> >>
> >>
> >>> Perhaps this is all just evidence of tech debt in the perf_regs.c code
> >>> :-/ The bit that's relevant to the patch here is that I think this is
> >>> adding to the tech debt problem as 11 more functions are added to
> >>> perf_regs.h.
> >> Yeah, 11 new __weak functions seems too much, we may merge the same kinds
> >> of functions, like merging *_simd_reg_mask() and  *_pred_reg_mask() to a
> >> single function with an type argument, then the new added __weak functions
> >> could shrink half.
> > There could be a good reason for 11 weak functions :-) In the
> > perf_event.h you've added to the sample event:
> > ```
> > +        *        u64                   regs[weight(mask)];
> > +        *        struct {
> > +        *              u16 nr_vectors;
> > +        *              u16 vector_qwords;
> > +        *              u16 nr_pred;
> > +        *              u16 pred_qwords;
> > +        *              u64 data[nr_vectors * vector_qwords + nr_pred
> > * pred_qwords];
> > +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> > +        *      } && PERF_SAMPLE_REGS_USER
> > ```
> > so these things are readable/writable outside of builds with arch/x86
> > compiled in, which is why it seems odd that there needs to be arch
> > code in the common code to handle them. Similar to how I needed to get
> > the retirement latency parsing out of the arch/x86 directory as
> > potentially you could be looking at a perf.data file with retirement
> > latencies in it on a non-x86 platform.
>
> Ian, I'm not sure if I fully get your point. If not, please correct.
>
> Although these new introduced fields are generic and existed on all
> architectures, it's not enough to get all the necessary information to dump
> or parse the SIMD registers, e.g., the SIMD register name.
>
> Let's take dumpling the sampled value of SIMD registers as an example.

> We know there could be different kinds of SIMD register on different archs,
> like XMM/YMM/ZMM on x86 and V-registers/Z-registers on ARM.
>
> Currently we only know the register number and width from generic fields,
> we have no way to directly know the exact name this SIMD register
> corresponds. We have to involve the arch-specific function to figure out it
> and then print them.
>
> At least for now, it looks we still need these arch-specific functions ...

Thanks Dapeng. I started by thinking out loud, so I'm not saying this
is something to necessarily fix in the patch series but it probably is
something that needs to be fixed.

You mention that different archs have different registers and so we
need different routines for those archs, implying weak symbols, etc.
We do actually have generic register dumping code in get_dwarf_regstr:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/dwarf-regs.c?h=perf-tools-next#n33
It takes the dwarf register number, the ELF Ehdr e_machine and for the
purposes of csky the e_flags. If you want the e_machine for the perf
binary itself (such as in perf record when you don't yet have a
perf.data file) there is an EM_HOST value:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/include/dwarf-regs.h?h=perf-tools-next#n27
Perf has historically used a CPUID string, but I'd like to deprecate
that in favor of just using e_machine (and possibly e_flags) values.
We should probably have CPUID string to e_machine convesion utility
functions and remove cpuid from the perf_env:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/env.h?h=perf-tools-next#n67
but anyway, my point isn't about the e_machine values.

What I'm trying to say is that weak symbols and code in arch
inherently means the cross platform development will break. For
example, before:
https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/
perf_parse_sample_weight just simply didn't exist outside of PowerPC
and x86. This meant that the part of the perf event in the perf.data
containing the sample weights couldn't be parsed on say an ARM64 build
of perf. This meant the values couldn't even be dumped in perf script.
The values are, however, described in the cross platform perf sample
event format, much as the SIMD registers are here.

It seems as we have from a perf.data file at least a CPUID string from
the header features, a perf_event_attr and the register number, we
should be able to do something like get_dwarf_regstr. Such a function
wouldn't be in the arch directory as we wouldn't want to interpret
registers in events just on x86 platforms (as with the retirement
latency). If we're not able to do this then there seems to be
something wrong with the SIMD change and perhaps we need to capture
more information in the perf.data file header.

Thanks,
Ian

> >
> > Thanks,
> > Ian
> >
> >>> Thanks,
> >>> Ian
> >>>
> >>>>> Thanks,
> >>>>> Ian
> >>>>>
> >>>>>
> >>>>>
> >>>>>>> Thanks,
> >>>>>>> Ian
> >>>>>>>
> >>>>>>>> +
> >>>>>>>>  const char *perf_reg_name(int id, const char *arch)
> >>>>>>>>  {
> >>>>>>>>         const char *reg_name = NULL;
> >>>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >>>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
> >>>>>>>> --- a/tools/perf/util/perf_regs.h
> >>>>>>>> +++ b/tools/perf/util/perf_regs.h
> >>>>>>>> @@ -24,9 +24,20 @@ enum {
> >>>>>>>>  };
> >>>>>>>>
> >>>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >>>>>>>> +bool arch_has_simd_regs(u64 mask);
> >>>>>>>>  uint64_t arch__intr_reg_mask(void);
> >>>>>>>>  uint64_t arch__user_reg_mask(void);
> >>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
> >>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> >>>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
> >>>>>>>> +uint64_t arch__user_simd_reg_mask(void);
> >>>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
> >>>>>>>> +uint64_t arch__user_pred_reg_mask(void);
> >>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>>
> >>>>>>>>  const char *perf_reg_name(int id, const char *arch);
> >>>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >>>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >>>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
> >>>>>>>> --- a/tools/perf/util/record.h
> >>>>>>>> +++ b/tools/perf/util/record.h
> >>>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
> >>>>>>>>         unsigned int  user_freq;
> >>>>>>>>         u64           branch_stack;
> >>>>>>>>         u64           sample_intr_regs;
> >>>>>>>> +       u64           sample_intr_vec_regs;
> >>>>>>>>         u64           sample_user_regs;
> >>>>>>>> +       u64           sample_user_vec_regs;
> >>>>>>>> +       u16           sample_pred_regs_qwords;
> >>>>>>>> +       u16           sample_vec_regs_qwords;
> >>>>>>>> +       u16           sample_intr_pred_regs;
> >>>>>>>> +       u16           sample_user_pred_regs;
> >>>>>>>>         u64           default_interval;
> >>>>>>>>         u64           user_interval;
> >>>>>>>>         size_t        auxtrace_snapshot_size;
> >>>>>>>> --
> >>>>>>>> 2.34.1
> >>>>>>>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-05 12:39   ` Peter Zijlstra
@ 2025-12-07 20:44     ` Andi Kleen
  2025-12-08  6:46     ` Mi, Dapeng
  1 sibling, 0 replies; 55+ messages in thread
From: Andi Kleen @ 2025-12-07 20:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Eranian Stephane, Mark Rutland,
	broonie, Ravi Bangoria, linux-kernel, linux-perf-users, Zide Chen,
	Falcon Thomas, Dapeng Mi, Xudong Hao

On Fri, Dec 05, 2025 at 01:39:40PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:57PM +0800, Dapeng Mi wrote:
> > When two or more identical PEBS events with the same sampling period are
> > programmed on a mix of PDIST and non-PDIST counters, multiple
> > back-to-back NMIs can be triggered.
> 
> This is a hardware defect -- albeit a fairly common one.

Actually I disagree on that. PEBS is essentially a shared memory
protocol between two asynchronous agents. To prevent this you would need a
locking protocol somehow for the memory, otherwise the sender (PEBS) has
no way to know that the PMI handler is finished reading the memory
buffers.

So it cannot know that the second event was already parsed, and
has to send the second PMI just in case.

It didn't happen with the legacy PEBS because it always 
collapsed multiple counters into one, but that was really a race
too.

-Andi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-05 16:35                   ` Ian Rogers
@ 2025-12-08  4:20                     ` Mi, Dapeng
  0 siblings, 0 replies; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-08  4:20 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/6/2025 12:35 AM, Ian Rogers wrote:
> On Fri, Dec 5, 2025 at 12:10 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 12/5/2025 2:38 PM, Ian Rogers wrote:
>>> On Thu, Dec 4, 2025 at 8:00 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 12/5/2025 12:16 AM, Ian Rogers wrote:
>>>>> On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>> On 12/4/2025 3:49 PM, Ian Rogers wrote:
>>>>>>> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
>>>>>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>>>>
>>>>>>>>>> This patch adds support for the newly introduced SIMD register sampling
>>>>>>>>>> format by adding the following functions:
>>>>>>>>>>
>>>>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>>>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>>>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>>>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>>>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>>>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>>>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>>>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>>>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>>>>>>>> OPMASK).
>>>>>>>>>>
>>>>>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>>>>>>>> these newly introduced SIMD registers. Currently, each type of register
>>>>>>>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>>>>>>>> not supported. For example, all XMM registers are sampled together rather
>>>>>>>>>> than sampling only XMM0.
>>>>>>>>>>
>>>>>>>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>>>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>>>>>>>
>>>>>>>>>> With this patch, all supported sampling registers on x86 platforms are
>>>>>>>>>> displayed as follows.
>>>>>>>>>>
>>>>>>>>>>  $perf record -I?
>>>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>>>
>>>>>>>>>>  $perf record --user-regs=?
>>>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>>>> ---
>>>>>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>>>>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>>>>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>>>>>>>  tools/perf/util/record.h                  |   6 +
>>>>>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>> index 12fd93f04802..db41430f3b07 100644
>>>>>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>> @@ -13,6 +13,49 @@
>>>>>>>>>>  #include "../../../util/pmu.h"
>>>>>>>>>>  #include "../../../util/pmus.h"
>>>>>>>>>>
>>>>>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>>>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>>>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>>>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>>>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>>>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>>>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>>>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>>>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>>>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>>>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>>>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>>>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>>>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>>>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>>>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>>>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>>>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>>>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>>>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>>>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>>>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>>>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>>>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>>>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>>>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>>>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>>>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>>>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>>>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>>>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>>>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>>>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>>>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>>>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>>>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>>>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>>>>>>>> +#endif
>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>>>>>>>         return SDT_ARG_VALID;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>>>>>>>> To make the code easier to read, it'd be nice to document sample_type,
>>>>>>>>> qwords and mask here.
>>>>>>>> Sure.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +{
>>>>>>>>>> +       struct perf_event_attr attr = {
>>>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>>>> +               .disabled                       = 1,
>>>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>>>> +               .sample_simd_regs_enabled       = 1,
>>>>>>>>>> +       };
>>>>>>>>>> +       int fd;
>>>>>>>>>> +
>>>>>>>>>> +       attr.sample_period = 1;
>>>>>>>>>> +
>>>>>>>>>> +       if (!pred) {
>>>>>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>>>>>>>> +       } else {
>>>>>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>>>> +               struct perf_pmu *pmu = NULL;
>>>>>>>>>> +               __u64 type = PERF_TYPE_RAW;
>>>>>>>>> It should be okay to do:
>>>>>>>>> __u64 type = perf_pmus__find_core_pmu()->type
>>>>>>>>> rather than have the whole loop below.
>>>>>>>> Sure. Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> +               /*
>>>>>>>>>> +                * The same register set is supported among different hybrid PMUs.
>>>>>>>>>> +                * Only check the first available one.
>>>>>>>>>> +                */
>>>>>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>>>>>>>> +                       type = pmu->type;
>>>>>>>>>> +                       break;
>>>>>>>>>> +               }
>>>>>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       event_attr_init(&attr);
>>>>>>>>>> +
>>>>>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>>>> +       if (fd != -1) {
>>>>>>>>>> +               close(fd);
>>>>>>>>>> +               return true;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return false;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       bool supported = false;
>>>>>>>>>> +       u64 bits;
>>>>>>>>>> +
>>>>>>>>>> +       *mask = 0;
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +
>>>>>>>>>> +       switch (reg) {
>>>>>>>>>> +       case PERF_REG_X86_XMM:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       case PERF_REG_X86_YMM:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       case PERF_REG_X86_ZMM:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>>>>>>>> +                       break;
>>>>>>>>>> +               }
>>>>>>>>>> +
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       default:
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return supported;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       bool supported = false;
>>>>>>>>>> +       u64 bits;
>>>>>>>>>> +
>>>>>>>>>> +       *mask = 0;
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +
>>>>>>>>>> +       switch (reg) {
>>>>>>>>>> +       case PERF_REG_X86_OPMASK:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       default:
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return supported;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool has_cap_simd_regs(void)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>>>>>>>> +       static bool has_cap_simd_regs;
>>>>>>>>>> +       static bool cached;
>>>>>>>>>> +
>>>>>>>>>> +       if (cached)
>>>>>>>>>> +               return has_cap_simd_regs;
>>>>>>>>>> +
>>>>>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>>>> +       cached = true;
>>>>>>>>>> +
>>>>>>>>>> +       return has_cap_simd_regs;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +bool arch_has_simd_regs(u64 mask)
>>>>>>>>>> +{
>>>>>>>>>> +       return has_cap_simd_regs() &&
>>>>>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>>>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>>>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>>>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>>>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_simd_reg_masks;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_pred_reg_masks;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool x86_intr_simd_updated;
>>>>>>>>>> +static u64 x86_intr_simd_reg_mask;
>>>>>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>> Could we add some comments? I can kind of figure out the updated is a
>>>>>>>>> check for lazy initialization and what masks are, qwords is an odd
>>>>>>>>> one. The comment could also point out that SIMD doesn't mean the
>>>>>>>>> machine supports SIMD, but SIMD registers are supported in perf
>>>>>>>>> events.
>>>>>>>> Sure.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +static bool x86_user_simd_updated;
>>>>>>>>>> +static u64 x86_user_simd_reg_mask;
>>>>>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>> +
>>>>>>>>>> +static bool x86_intr_pred_updated;
>>>>>>>>>> +static u64 x86_intr_pred_reg_mask;
>>>>>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +static bool x86_user_pred_updated;
>>>>>>>>>> +static u64 x86_user_pred_reg_mask;
>>>>>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool supported;
>>>>>>>>>> +       u64 mask = 0;
>>>>>>>>>> +       int reg;
>>>>>>>>>> +
>>>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>>>> +               return 0;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>>>>>>>> +               return x86_intr_simd_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>>>>>>>> +               return x86_user_simd_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>> +               supported = false;
>>>>>>>>>> +
>>>>>>>>>> +               if (!r->mask)
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>>>> +
>>>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>>>>>>>> +                       break;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_intr_simd_mask[reg],
>>>>>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_user_simd_mask[reg],
>>>>>>>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>>>>>>>> +               if (supported)
>>>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>>>> +               x86_intr_simd_reg_mask = mask;
>>>>>>>>>> +               x86_intr_simd_updated = true;
>>>>>>>>>> +       } else {
>>>>>>>>>> +               x86_user_simd_reg_mask = mask;
>>>>>>>>>> +               x86_user_simd_updated = true;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool supported;
>>>>>>>>>> +       u64 mask = 0;
>>>>>>>>>> +       int reg;
>>>>>>>>>> +
>>>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>>>> +               return 0;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>>>>>>>> +               return x86_intr_pred_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>>>>>>>> +               return x86_user_pred_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>> +               supported = false;
>>>>>>>>>> +
>>>>>>>>>> +               if (!r->mask)
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>>>> +
>>>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>>>>>>>> +                       break;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_intr_pred_mask[reg],
>>>>>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_user_pred_mask[reg],
>>>>>>>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>>>>>>>> +               if (supported)
>>>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>>>> +               x86_intr_pred_reg_mask = mask;
>>>>>>>>>> +               x86_intr_pred_updated = true;
>>>>>>>>>> +       } else {
>>>>>>>>>> +               x86_user_pred_reg_mask = mask;
>>>>>>>>>> +               x86_user_pred_updated = true;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>> This feels repetitive with __arch__simd_reg_mask, could they be
>>>>>>>>> refactored together?
>>>>>>>> hmm, it looks we can extract the for loop as a common function. The other
>>>>>>>> parts are hard to be generalized since they are manipulating different
>>>>>>>> variables. If we want to generalize them, we have to introduce lots of "if
>>>>>>>> ... else" branches and that would make code hard to be read.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = 0;
>>>>>>>>>> +
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>>>>>>>> +               if (intr) {
>>>>>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>>>>>>>> +                       mask = x86_intr_simd_mask[reg];
>>>>>>>>>> +               } else {
>>>>>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>>>>>>>> +                       mask = x86_user_simd_mask[reg];
>>>>>>>>>> +               }
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = 0;
>>>>>>>>>> +
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>>>>>>>> +               if (intr) {
>>>>>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>>>>>>>> +                       mask = x86_intr_pred_mask[reg];
>>>>>>>>>> +               } else {
>>>>>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>>>>>>>> +                       mask = x86_user_pred_mask[reg];
>>>>>>>>>> +               }
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_intr_simd_updated)
>>>>>>>>>> +               arch__intr_simd_reg_mask();
>>>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_user_simd_updated)
>>>>>>>>>> +               arch__user_simd_reg_mask();
>>>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_intr_pred_updated)
>>>>>>>>>> +               arch__intr_pred_reg_mask();
>>>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_user_pred_updated)
>>>>>>>>>> +               arch__user_pred_reg_mask();
>>>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>>>>>>>  {
>>>>>>>>>> +       if (has_cap_simd_regs())
>>>>>>>>>> +               return sample_reg_masks_ext;
>>>>>>>>>>         return sample_reg_masks;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> -uint64_t arch__intr_reg_mask(void)
>>>>>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>>>>>>>  {
>>>>>>>>>>         struct perf_event_attr attr = {
>>>>>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>>>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>>>>>>>> -               .precise_ip             = 1,
>>>>>>>>>> -               .disabled               = 1,
>>>>>>>>>> -               .exclude_kernel         = 1,
>>>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>>>> +               .precise_ip                     = 1,
>>>>>>>>>> +               .disabled                       = 1,
>>>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>>>>>>>         };
>>>>>>>>>>         int fd;
>>>>>>>>>>         /*
>>>>>>>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>>>>>>>          */
>>>>>>>>>>         attr.sample_period = 1;
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +               attr.sample_regs_intr = mask;
>>>>>>>>>> +       else
>>>>>>>>>> +               attr.sample_regs_user = mask;
>>>>>>>>>>
>>>>>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>>>>                 struct perf_pmu *pmu = NULL;
>>>>>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>>>>         if (fd != -1) {
>>>>>>>>>>                 close(fd);
>>>>>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>>>>>>>> +               return mask;
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>>>> +
>>>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>>>> +                                        true);
>>>>>>>>> It's nice to label constant arguments like this something like:
>>>>>>>>> /*has_simd_regs=*/true);
>>>>>>>>>
>>>>>>>>> Tools like clang-tidy even try to enforce the argument names match the comments.
>>>>>>>> Sure.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>>>> +                                        true);
>>>>>>>>>> +       } else
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>>  uint64_t arch__user_reg_mask(void)
>>>>>>>>>>  {
>>>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>>>> +
>>>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>>>> +                                        true);
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>>>> +                                        true);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>> The code is repetitive here, could we refactor into a single function
>>>>>>>>> passing in a user or instr value?
>>>>>>>> Sure. Would extract the common part.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>  }
>>>>>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>>>>>>>> --- a/tools/perf/util/evsel.c
>>>>>>>>>> +++ b/tools/perf/util/evsel.c
>>>>>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>>>>>>>> index cda1c620968e..0bd100392889 100644
>>>>>>>>>> --- a/tools/perf/util/parse-regs-options.c
>>>>>>>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>>>>>>>> @@ -4,19 +4,139 @@
>>>>>>>>>>  #include <stdint.h>
>>>>>>>>>>  #include <string.h>
>>>>>>>>>>  #include <stdio.h>
>>>>>>>>>> +#include <linux/bitops.h>
>>>>>>>>>>  #include "util/debug.h"
>>>>>>>>>>  #include <subcmd/parse-options.h>
>>>>>>>>>>  #include "util/perf_regs.h"
>>>>>>>>>>  #include "util/parse-regs-options.h"
>>>>>>>>>> +#include "record.h"
>>>>>>>>>> +
>>>>>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       if (!simd_mask)
>>>>>>>>>> +               return;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (!(r->mask & simd_mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               if (bitmap)
>>>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>>>> +       }
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       if (!pred_mask)
>>>>>>>>>> +               return;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (!(r->mask & pred_mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               if (bitmap)
>>>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>>>> +       }
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool matched = false;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               matched = true;
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       /* Just need the highest qwords */
>>>>>>>>> I'm not following here. Does the bitmap need to handle gaps?
>>>>>>>> Currently no. In theory, the kernel supports user space only samples a
>>>>>>>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
>>>>>>>> supports 16 XMM registers on XMM), but it's not supported to avoid
>>>>>>>> introducing too much complexity in perf tools. Moreover, I don't think end
>>>>>>>> users have such requirement. In most cases, users should only know which
>>>>>>>> kinds of SIMD registers their programs use but usually doesn't know and
>>>>>>>> care about which exact SIMD register is used.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>>>>>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>>>>>>>> +               else
>>>>>>>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return matched;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool matched = false;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               matched = true;
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       /* Just need the highest qwords */
>>>>>>>>> Again repetitive, could we have a single function?
>>>>>>>> Yes, I suppose the for loop at least can be extracted as a common function.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>>>>>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>>>>>>>> +               else
>>>>>>>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return matched;
>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>>  static int
>>>>>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>  {
>>>>>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>>>>>>>         const struct sample_reg *r = NULL;
>>>>>>>>>> +       struct record_opts *opts;
>>>>>>>>>>         char *s, *os = NULL, *p;
>>>>>>>>>> -       int ret = -1;
>>>>>>>>>> +       bool has_simd_regs = false;
>>>>>>>>>>         uint64_t mask;
>>>>>>>>>> +       uint64_t simd_mask;
>>>>>>>>>> +       uint64_t pred_mask;
>>>>>>>>>> +       int ret = -1;
>>>>>>>>>>
>>>>>>>>>>         if (unset)
>>>>>>>>>>                 return 0;
>>>>>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>         if (*mode)
>>>>>>>>>>                 return -1;
>>>>>>>>>>
>>>>>>>>>> -       if (intr)
>>>>>>>>>> +       if (intr) {
>>>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>>>>>>>                 mask = arch__intr_reg_mask();
>>>>>>>>>> -       else
>>>>>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>>>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>>>>>>>> +       } else {
>>>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>>>>>>>                 mask = arch__user_reg_mask();
>>>>>>>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>>>>>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>>>>>>>> +       }
>>>>>>>>>>
>>>>>>>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>>>>>>>         if (str) {
>>>>>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>                                         if (r->mask & mask)
>>>>>>>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>>>>>>>                                 }
>>>>>>>>>> +                               __print_simd_regs(intr, simd_mask);
>>>>>>>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>>>>>>>                                 fputc('\n', stderr);
>>>>>>>>>>                                 /* just printing available regs */
>>>>>>>>>>                                 goto error;
>>>>>>>>>>                         }
>>>>>>>>>> +
>>>>>>>>>> +                       if (simd_mask) {
>>>>>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>>>>>>>> +                               if (has_simd_regs)
>>>>>>>>>> +                                       goto next;
>>>>>>>>>> +                       }
>>>>>>>>>> +                       if (pred_mask) {
>>>>>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>>>>>>>> +                               if (has_simd_regs)
>>>>>>>>>> +                                       goto next;
>>>>>>>>>> +                       }
>>>>>>>>>> +
>>>>>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>>>>>>>                                         break;
>>>>>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>                         }
>>>>>>>>>>
>>>>>>>>>>                         *mode |= r->mask;
>>>>>>>>>> -
>>>>>>>>>> +next:
>>>>>>>>>>                         if (!p)
>>>>>>>>>>                                 break;
>>>>>>>>>>
>>>>>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>         ret = 0;
>>>>>>>>>>
>>>>>>>>>>         /* default to all possible regs */
>>>>>>>>>> -       if (*mode == 0)
>>>>>>>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>>>>>>>                 *mode = mask;
>>>>>>>>>>  error:
>>>>>>>>>>         free(os);
>>>>>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>>>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>>>>>>>
>>>>>>>>>>         return ret;
>>>>>>>>>>  }
>>>>>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>>>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>>>>>>>> --- a/tools/perf/util/perf_regs.c
>>>>>>>>>> +++ b/tools/perf/util/perf_regs.c
>>>>>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>>>>>>>         return SDT_ARG_SKIP;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>>>>>>>> +{
>>>>>>>>>> +       return false;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>>>>>>>  {
>>>>>>>>>>         return 0;
>>>>>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>>>>>>>         return 0;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>>>         SMPL_REG_END
>>>>>>>>>>  };
>>>>>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>>>>>>>         return sample_reg_masks;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_reg_masks;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_reg_masks;
>>>>>>>>>> +}
>>>>>>>>> Thinking out loud. I wonder if there is a way to hide the weak
>>>>>>>>> functions. It seems the support is tied to PMUs, particularly core
>>>>>>>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
>>>>>>>>> ask the PMU to parse the register strings, set up the perf_event_attr,
>>>>>>>>> etc. I'm somewhat scared these functions will be used on the report
>>>>>>>>> rather than record side of things, thereby breaking perf.data support
>>>>>>>>> when the host kernel does or doesn't have the SIMD support.
>>>>>>>> Ian, I don't quite follow your words.
>>>>>>>>
>>>>>>>> I don't quite understand how should we do for "push things into pmu and
>>>>>>>> arch pmu code". Current SIMD registers support follows the same way of the
>>>>>>>> general registers support. If we intend to change the way entirely, we'd
>>>>>>>> better have an independent patch-set.
>>>>>>>>
>>>>>>>> why these functions would break the perf.data repport? perf-report would
>>>>>>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
>>>>>>>> the flags is set (indicates there are SIMD registers data appended in the
>>>>>>>> record), perf-report would try to parse the SIMD registers data.
>>>>>>> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
>>>>>>> remove weak symbols like:
>>>>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
>>>>>>>
>>>>>>> For these patches what I'm imagining is that there is a Nova Lake
>>>>>>> generated perf.data file. Using perf report, script, etc. on the Nova
>>>>>>> Lake should expose all of the same mask, qword, etc. values as when
>>>>>>> the perf.data was generated and so things will work. If the perf.data
>>>>>>> file was taken to say my Alderlake then what will happen? Generally
>>>>>>> using the arch directory and weak symbols is a code smell that cross
>>>>>>> platform things are going to break - there should be sufficient data
>>>>>>> in the event and the perf_event_attr to fully decode what's going on.
>>>>>>> Sometimes tying things to a PMU name can avoid the use of the arch
>>>>>>> directory. We were able to avoid the arch directory to a good extent
>>>>>>> for the TPEBS code, even though it is a very modern Intel feature.
>>>>>> I see.
>>>>>>
>>>>>> But the sampling support for SIMD registers is different with the sample
>>>>>> weight processing in the patch
>>>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
>>>>>> Each arch may support different kinds of SIMD registers and furthermore
>>>>>> each kind of SIMD register may have different register number and register
>>>>>> width. It's quite hard to figure out some common functions or fields to
>>>>>> represent the name and attributes of these arch-specific SIMD registers.
>>>>>> These arch-specific information can only be told by the arch-specific code.
>>>>>> So it looks the __weak functions are still the easiest way to implement this.
>>>>>>
>>>>>> I don't think the perf.data parsing would be broken from a platform to
>>>>>> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
>>>>>> To indicates the presence of SIMD registers in record data, a new ABI flag
>>>>>> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
>>>>>> platform is new enough and can recognize this new flag, then the SIMD
>>>>>> registers data would be parsed correctly. Even though the perf tool is old
>>>>>> and have no support of SIMD register, the data of SIMD registers would just
>>>>>> be silently ignored and should not break the parsing.
>>>>> That's good to know. I'm confused then why these functions can't just
>>>>> be within the arch directory? For example, we don't expose the
>>>>> intel-pt PMU code in the common code except for the parsing parts. A
>>>>> lot of that is handled by the default perf_event_attr initialization
>>>>> that every PMU can have its own variant of:
>>>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123
>>>> I see. From my point of view, there seems no essential difference between a
>>>> function pointer and a __weak function, and it looks hard to find a common
>>>> data structure to save all these function pointers which needs to be called
>>>> in different places, like register name parsing, register data dumpling ...
>>>>
>>>>
>>>>> Perhaps this is all just evidence of tech debt in the perf_regs.c code
>>>>> :-/ The bit that's relevant to the patch here is that I think this is
>>>>> adding to the tech debt problem as 11 more functions are added to
>>>>> perf_regs.h.
>>>> Yeah, 11 new __weak functions seems too much, we may merge the same kinds
>>>> of functions, like merging *_simd_reg_mask() and  *_pred_reg_mask() to a
>>>> single function with an type argument, then the new added __weak functions
>>>> could shrink half.
>>> There could be a good reason for 11 weak functions :-) In the
>>> perf_event.h you've added to the sample event:
>>> ```
>>> +        *        u64                   regs[weight(mask)];
>>> +        *        struct {
>>> +        *              u16 nr_vectors;
>>> +        *              u16 vector_qwords;
>>> +        *              u16 nr_pred;
>>> +        *              u16 pred_qwords;
>>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred
>>> * pred_qwords];
>>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>>> +        *      } && PERF_SAMPLE_REGS_USER
>>> ```
>>> so these things are readable/writable outside of builds with arch/x86
>>> compiled in, which is why it seems odd that there needs to be arch
>>> code in the common code to handle them. Similar to how I needed to get
>>> the retirement latency parsing out of the arch/x86 directory as
>>> potentially you could be looking at a perf.data file with retirement
>>> latencies in it on a non-x86 platform.
>> Ian, I'm not sure if I fully get your point. If not, please correct.
>>
>> Although these new introduced fields are generic and existed on all
>> architectures, it's not enough to get all the necessary information to dump
>> or parse the SIMD registers, e.g., the SIMD register name.
>>
>> Let's take dumpling the sampled value of SIMD registers as an example.
>> We know there could be different kinds of SIMD register on different archs,
>> like XMM/YMM/ZMM on x86 and V-registers/Z-registers on ARM.
>>
>> Currently we only know the register number and width from generic fields,
>> we have no way to directly know the exact name this SIMD register
>> corresponds. We have to involve the arch-specific function to figure out it
>> and then print them.
>>
>> At least for now, it looks we still need these arch-specific functions ...
> Thanks Dapeng. I started by thinking out loud, so I'm not saying this
> is something to necessarily fix in the patch series but it probably is
> something that needs to be fixed.
>
> You mention that different archs have different registers and so we
> need different routines for those archs, implying weak symbols, etc.
> We do actually have generic register dumping code in get_dwarf_regstr:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/dwarf-regs.c?h=perf-tools-next#n33
> It takes the dwarf register number, the ELF Ehdr e_machine and for the
> purposes of csky the e_flags. If you want the e_machine for the perf
> binary itself (such as in perf record when you don't yet have a
> perf.data file) there is an EM_HOST value:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/include/dwarf-regs.h?h=perf-tools-next#n27
> Perf has historically used a CPUID string, but I'd like to deprecate
> that in favor of just using e_machine (and possibly e_flags) values.
> We should probably have CPUID string to e_machine convesion utility
> functions and remove cpuid from the perf_env:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/env.h?h=perf-tools-next#n67
> but anyway, my point isn't about the e_machine values.
>
> What I'm trying to say is that weak symbols and code in arch
> inherently means the cross platform development will break. For
> example, before:
> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/
> perf_parse_sample_weight just simply didn't exist outside of PowerPC
> and x86. This meant that the part of the perf event in the perf.data
> containing the sample weights couldn't be parsed on say an ARM64 build
> of perf. This meant the values couldn't even be dumped in perf script.
> The values are, however, described in the cross platform perf sample
> event format, much as the SIMD registers are here.
>
> It seems as we have from a perf.data file at least a CPUID string from
> the header features, a perf_event_attr and the register number, we
> should be able to do something like get_dwarf_regstr. Such a function
> wouldn't be in the arch directory as we wouldn't want to interpret
> registers in events just on x86 platforms (as with the retirement
> latency). If we're not able to do this then there seems to be
> something wrong with the SIMD change and perhaps we need to capture
> more information in the perf.data file header.

Thanks Ian for your detailed explanation. I understood your point right now.

I originally thought there could be no such requirement that parses a
perf.data file in a machine with totally different arch. But it seems there
is as you said.

Then I suppose we need to do same thing for the
perf_reg_value()/perf_simd_reg_value() just like perf_reg_name() does, but
currently the "arch" string comes from perf_env__arch() helper which should
be arch of perf running instead of the arch which is sampled on.

Anyway, I think we can make the retirement of __weak functions as the 1st
step. As for the replacement from cpuid or env->arch to EM_HOST or
something else (I'm not sure how much complex it would be, but suppose it
should not be sample), we'd better to have an independent patchs-set to
implement it since it has no direct relationship with current SIMD
registers sampling support.


>
> Thanks,
> Ian
>
>>> Thanks,
>>> Ian
>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>>> Thanks,
>>>>>>> Ian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ian
>>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>>>>>>>  {
>>>>>>>>>>         const char *reg_name = NULL;
>>>>>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>>>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>>>>>>>> --- a/tools/perf/util/perf_regs.h
>>>>>>>>>> +++ b/tools/perf/util/perf_regs.h
>>>>>>>>>> @@ -24,9 +24,20 @@ enum {
>>>>>>>>>>  };
>>>>>>>>>>
>>>>>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>>>>>>>> +bool arch_has_simd_regs(u64 mask);
>>>>>>>>>>  uint64_t arch__intr_reg_mask(void);
>>>>>>>>>>  uint64_t arch__user_reg_mask(void);
>>>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>
>>>>>>>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>>>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>>>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>>>>>>>> --- a/tools/perf/util/record.h
>>>>>>>>>> +++ b/tools/perf/util/record.h
>>>>>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>>>>>>>         unsigned int  user_freq;
>>>>>>>>>>         u64           branch_stack;
>>>>>>>>>>         u64           sample_intr_regs;
>>>>>>>>>> +       u64           sample_intr_vec_regs;
>>>>>>>>>>         u64           sample_user_regs;
>>>>>>>>>> +       u64           sample_user_vec_regs;
>>>>>>>>>> +       u16           sample_pred_regs_qwords;
>>>>>>>>>> +       u16           sample_vec_regs_qwords;
>>>>>>>>>> +       u16           sample_intr_pred_regs;
>>>>>>>>>> +       u16           sample_user_pred_regs;
>>>>>>>>>>         u64           default_interval;
>>>>>>>>>>         u64           user_interval;
>>>>>>>>>>         size_t        auxtrace_snapshot_size;
>>>>>>>>>> --
>>>>>>>>>> 2.34.1
>>>>>>>>>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 07/19] perf: Add sampling support for SIMD registers
  2025-12-05 11:07   ` Peter Zijlstra
@ 2025-12-08  5:24     ` Mi, Dapeng
  0 siblings, 0 replies; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-08  5:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 7:07 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:48PM +0800, Dapeng Mi wrote:
>
>> @@ -545,6 +547,25 @@ struct perf_event_attr {
>>  	__u64	sig_data;
>>  
>>  	__u64	config3; /* extension of config2 */
>> +
>> +
>> +	/*
>> +	 * Defines set of SIMD registers to dump on samples.
>> +	 * The sample_simd_regs_enabled !=0 implies the
>> +	 * set of SIMD registers is used to config all SIMD registers.
>> +	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
>> +	 * config some SIMD registers on X86.
>> +	 */
>> +	union {
>> +		__u16 sample_simd_regs_enabled;
>> +		__u16 sample_simd_pred_reg_qwords;
>> +	};
>> +	__u32 sample_simd_pred_reg_intr;
>> +	__u32 sample_simd_pred_reg_user;
>> +	__u16 sample_simd_vec_reg_qwords;
>> +	__u64 sample_simd_vec_reg_intr;
>> +	__u64 sample_simd_vec_reg_user;
>> +	__u32 __reserved_4;
>>  };
> This is poorly aligned and causes holes.
>
> This:
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index d292f96bc06f..2deb8dd0ca37 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -545,6 +545,14 @@ struct perf_event_attr {
>  	__u64	sig_data;
>  
>  	__u64	config3; /* extension of config2 */
> +
> +	__u16	sample_simd_pred_reg_qwords;
> +	__u32	sample_simd_pred_reg_intr;
> +	__u32	sample_simd_pred_reg_user;
> +	__u16	sample_simd_vec_reg_qwords;
> +	__u64	sample_simd_vec_reg_intr;
> +	__u64	sample_simd_vec_reg_user;
> +	__u32	__reserved_4;
>  };
>  
>  /*
>
> results in:
>
>         __u64                      config3;              /*   128     8 */
>         __u16                      sample_simd_pred_reg_qwords; /*   136     2 */
>
>         /* XXX 2 bytes hole, try to pack */
>
>         __u32                      sample_simd_pred_reg_intr; /*   140     4 */
>         __u32                      sample_simd_pred_reg_user; /*   144     4 */
>         __u16                      sample_simd_vec_reg_qwords; /*   148     2 */
>
>         /* XXX 2 bytes hole, try to pack */
>
>         __u64                      sample_simd_vec_reg_intr; /*   152     8 */
>         __u64                      sample_simd_vec_reg_user; /*   160     8 */
>         __u32                      __reserved_4;         /*   168     4 */
>
>
>
> A better layout might be:
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index d292f96bc06f..f72707e9df68 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -545,6 +545,15 @@ struct perf_event_attr {
>  	__u64	sig_data;
>  
>  	__u64	config3; /* extension of config2 */
> +
> +	__u16	sample_simd_pred_reg_qwords;
> +	__u16	sample_simd_vec_reg_qwords;
> +	__u32	__reserved_4;
> +
> +	__u32	sample_simd_pred_reg_intr;
> +	__u32	sample_simd_pred_reg_user;
> +	__u64	sample_simd_vec_reg_intr;
> +	__u64	sample_simd_vec_reg_user;
>  };
>  
>  /*
>
> such that:
>
>         __u64                      config3;              /*   128     8 */
>         __u16                      sample_simd_pred_reg_qwords; /*   136     2 */
>         __u16                      sample_simd_vec_reg_qwords; /*   138     2 */
>         __u32                      __reserved_4;         /*   140     4 */
>         __u32                      sample_simd_pred_reg_intr; /*   144     4 */
>         __u32                      sample_simd_pred_reg_user; /*   148     4 */
>         __u64                      sample_simd_vec_reg_intr; /*   152     8 */
>         __u64                      sample_simd_vec_reg_user; /*   160     8 */
>
Sure. Thanks.


>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 07/19] perf: Add sampling support for SIMD registers
  2025-12-05 11:40   ` Peter Zijlstra
@ 2025-12-08  6:00     ` Mi, Dapeng
  0 siblings, 0 replies; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-08  6:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 7:40 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:48PM +0800, Dapeng Mi wrote:
>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 3e9c48fa2202..b19de038979e 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -7469,6 +7469,50 @@ perf_output_sample_regs(struct perf_output_handle *handle,
>>  	}
>>  }
>>  
>> +static void
>> +perf_output_sample_simd_regs(struct perf_output_handle *handle,
>> +			     struct perf_event *event,
>> +			     struct pt_regs *regs,
>> +			     u64 mask, u16 pred_mask)
>> +{
>> +	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
>> +	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
>> +	u64 pred_bitmap = pred_mask;
>> +	u64 bitmap = mask;
>> +	u16 nr_vectors;
>> +	u16 nr_pred;
>> +	int bit;
>> +	u64 val;
>> +	u16 i;
>> +
>> +	nr_vectors = hweight64(bitmap);
>> +	nr_pred = hweight64(pred_bitmap);
>> +
>> +	perf_output_put(handle, nr_vectors);
>> +	perf_output_put(handle, vec_qwords);
>> +	perf_output_put(handle, nr_pred);
>> +	perf_output_put(handle, pred_qwords);
>> +
>> +	if (nr_vectors) {
>> +		for_each_set_bit(bit, (unsigned long *)&bitmap,
> This isn't right. Yes we do this all the time in the x86 code, but there
> we can assume little-endian byte order. This is core code and is also
> used on big-endian systems where this is very much broken.

Oh, yes. Just ignored the endians. Would fix it in next version. Thanks.


>
>> +				 sizeof(bitmap) * BITS_PER_BYTE) {
>> +			for (i = 0; i < vec_qwords; i++) {
>> +				val = perf_simd_reg_value(regs, bit, i, false);
>> +				perf_output_put(handle, val);
>> +			}
>> +		}
>> +	}
>> +	if (nr_pred) {
>> +		for_each_set_bit(bit, (unsigned long *)&pred_bitmap,
>> +				 sizeof(pred_bitmap) * BITS_PER_BYTE) {
>> +			for (i = 0; i < pred_qwords; i++) {
>> +				val = perf_simd_reg_value(regs, bit, i, true);
>> +				perf_output_put(handle, val);
>> +			}
>> +		}
>> +	}
>> +}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
  2025-12-05 11:25   ` Peter Zijlstra
@ 2025-12-08  6:10     ` Mi, Dapeng
  0 siblings, 0 replies; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-08  6:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 7:25 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:49PM +0800, Dapeng Mi wrote:
>
>> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
>> index 7c9d2bb3833b..c3862e5fdd6d 100644
>> --- a/arch/x86/include/uapi/asm/perf_regs.h
>> +++ b/arch/x86/include/uapi/asm/perf_regs.h
>> @@ -55,4 +55,21 @@ enum perf_event_x86_regs {
>>  
>>  #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
>>  
>> +enum {
>> +	PERF_REG_X86_XMM,
>> +	PERF_REG_X86_MAX_SIMD_REGS,
>> +};
>> +
>> +enum {
>> +	PERF_X86_SIMD_XMM_REGS      = 16,
>> +	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
>> +};
>> +
>> +#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
>> +
>> +enum {
>> +	PERF_X86_XMM_QWORDS      = 2,
>> +	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
>> +};
>> +
>>  #endif /* _ASM_X86_PERF_REGS_H */
> I don't understand this bit -- the next few patches add to it for YMM
> and ZMM, but what's the point? I don't see why this is needed at all,
> let alone why it needs to be UABI.

Currently these bits are only used in user space perf tools. Let me remove
it from the header perf_regs.h.


>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields
  2025-12-05 12:16   ` Peter Zijlstra
@ 2025-12-08  6:11     ` Mi, Dapeng
  0 siblings, 0 replies; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-08  6:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 8:16 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:53PM +0800, Dapeng Mi wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> This patch enables sampling of APX eGPRs (R16 ~ R31) via the
>> sample_regs_* fields.
>>
>> To sample eGPRs, the sample_simd_regs_enabled field must be set. This
>> allows the spare space (reclaimed from the original XMM space) in the
>> sample_regs_* fields to be used for representing eGPRs.
>>
>> The perf_reg_value() function needs to check if the
>> PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then determine whether
>> to output eGPRs or legacy XMM registers to userspace.
>>
>> The perf_reg_validate() function is enhanced to validate the eGPRs bitmap
>> by adding a new argument, "simd_enabled".
>>
>> Currently, eGPRs sampling is only supported on the x86_64 architecture, as
>> APX is only available on x86_64 platforms.
>>
>> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/arm/kernel/perf_regs.c           |  2 +-
>>  arch/arm64/kernel/perf_regs.c         |  2 +-
>>  arch/csky/kernel/perf_regs.c          |  2 +-
>>  arch/loongarch/kernel/perf_regs.c     |  2 +-
>>  arch/mips/kernel/perf_regs.c          |  2 +-
>>  arch/parisc/kernel/perf_regs.c        |  2 +-
>>  arch/powerpc/perf/perf_regs.c         |  2 +-
>>  arch/riscv/kernel/perf_regs.c         |  2 +-
>>  arch/s390/kernel/perf_regs.c          |  2 +-
> Perhaps split out the part where you modify the arch function interface?

Sure.


>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 13/19] perf/x86: Enable SSP sampling using sample_regs_* fields
  2025-12-05 12:20   ` Peter Zijlstra
@ 2025-12-08  6:21     ` Mi, Dapeng
  0 siblings, 0 replies; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-08  6:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 8:20 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:54PM +0800, Dapeng Mi wrote:
>> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
>> index ca242db3720f..c925af4160ad 100644
>> --- a/arch/x86/include/asm/perf_event.h
>> +++ b/arch/x86/include/asm/perf_event.h
>> @@ -729,6 +729,10 @@ struct x86_perf_regs {
>>  		u64	*egpr_regs;
>>  		struct apx_state *egpr;
>>  	};
>> +	union {
>> +		u64	*cet_regs;
>> +		struct cet_user_state *cet;
>> +	};
>>  };
> Are we envisioning more than just SSP?

No idea about this, currently only SSP is supported. 


>
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-05 12:39   ` Peter Zijlstra
  2025-12-07 20:44     ` Andi Kleen
@ 2025-12-08  6:46     ` Mi, Dapeng
  2025-12-08  8:50       ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-08  6:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 12/5/2025 8:39 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:57PM +0800, Dapeng Mi wrote:
>> When two or more identical PEBS events with the same sampling period are
>> programmed on a mix of PDIST and non-PDIST counters, multiple
>> back-to-back NMIs can be triggered.
> This is a hardware defect -- albeit a fairly common one.
>
>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index da48bcde8fce..a130d3f14844 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -3351,8 +3351,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>>  	 */
>>  	if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
>>  				 (unsigned long *)&status)) {
>> -		handled++;
>> -		static_call(x86_pmu_drain_pebs)(regs, &data);
>> +		handled += static_call(x86_pmu_drain_pebs)(regs, &data);
>>  
>>  		if (cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS] &&
>>  		    is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS]))
> Note that the old code would return handled++, while the new code:
>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index a01c72c03bd6..c7cdcd585574 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -2759,7 +2759,7 @@ __intel_pmu_pebs_events(struct perf_event *event,
>>  	__intel_pmu_pebs_last_event(event, iregs, regs, data, at, count, setup_sample);
>>  }
>>  
>> -static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
>> +static int intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
>>  {
>>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>  	struct debug_store *ds = cpuc->ds;
>> @@ -2768,7 +2768,7 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
>>  	int n;
>>  
>>  	if (!x86_pmu.pebs_active)
>> -		return;
>> +		return 0;
>>  
>>  	at  = (struct pebs_record_core *)(unsigned long)ds->pebs_buffer_base;
>>  	top = (struct pebs_record_core *)(unsigned long)ds->pebs_index;
>> @@ -2779,22 +2779,24 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
>>  	ds->pebs_index = ds->pebs_buffer_base;
>>  
>>  	if (!test_bit(0, cpuc->active_mask))
>> -		return;
>> +		return 0;
>>  
>>  	WARN_ON_ONCE(!event);
>>  
>>  	if (!event->attr.precise_ip)
>> -		return;
>> +		return 0;
>>  
>>  	n = top - at;
>>  	if (n <= 0) {
>>  		if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
>>  			intel_pmu_save_and_restart_reload(event, 0);
>> -		return;
>> +		return 0;
>>  	}
>>  
>>  	__intel_pmu_pebs_events(event, iregs, data, at, top, 0, n,
>>  				setup_pebs_fixed_sample_data);
>> +
>> +	return 0;
>>  }
>>  
>>  static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64 mask)
>> @@ -2817,7 +2819,7 @@ static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64
>>  	}
>>  }
>>  
>> -static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
>> +static int intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
>>  {
>>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>  	struct debug_store *ds = cpuc->ds;
>> @@ -2830,7 +2832,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>>  	u64 mask;
>>  
>>  	if (!x86_pmu.pebs_active)
>> -		return;
>> +		return 0;
>>  
>>  	base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
>>  	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
>> @@ -2846,7 +2848,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>>  
>>  	if (unlikely(base >= top)) {
>>  		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
>> -		return;
>> +		return 0;
>>  	}
>>  
>>  	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
>> @@ -2931,6 +2933,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>>  						setup_pebs_fixed_sample_data);
>>  		}
>>  	}
>> +
>> +	return 0;
>>  }
>>  
>>  static __always_inline void
>> @@ -2984,7 +2988,7 @@ __intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
>>  
>>  }
>>  
>> -static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
>> +static int intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
>>  {
>>  	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
>>  	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
>> @@ -2997,7 +3001,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>>  	u64 mask;
>>  
>>  	if (!x86_pmu.pebs_active)
>> -		return;
>> +		return 0;
>>  
>>  	base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
>>  	top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
>> @@ -3010,7 +3014,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>>  
>>  	if (unlikely(base >= top)) {
>>  		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
>> -		return;
>> +		return 0;
>>  	}
>>  
>>  	if (!iregs)
>> @@ -3032,9 +3036,11 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>>  
>>  	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
>>  					    setup_pebs_adaptive_sample_data);
>> +
>> +	return 0;
>>  }
> will now return handled+=0 for all these. Which is a change in
> behaviour. Also:

This change only take effects for arch-PEBS. For the legacy PEBS, the
"handled" would still be added 1 unconditionally even the *_drain_pebs()
helpers always return 0.

    /*
     * PEBS overflow sets bit 62 in the global status register
     */
    if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long
*)&status)) {
        u64 pebs_enabled = cpuc->pebs_enabled;

        handled++;
        x86_pmu_handle_guest_pebs(regs, &data);
        static_call(x86_pmu_drain_pebs)(regs, &data);


>
>> -static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>> +static int intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>>  				      struct perf_sample_data *data)
>>  {
>>  	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
>> @@ -3044,13 +3050,14 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>>  	struct x86_perf_regs perf_regs;
>>  	struct pt_regs *regs = &perf_regs.regs;
>>  	void *base, *at, *top;
>> +	u64 events_bitmap = 0;
>>  	u64 mask;
>>  
>>  	rdmsrq(MSR_IA32_PEBS_INDEX, index.whole);
>>  
>>  	if (unlikely(!index.wr)) {
>>  		intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
>> -		return;
>> +		return 0;

If index.wr is 0, then it indicates there is no any PEBS record written in
buffer since last drain of PEBS buffer. In this case, the PEBS PMI should
not be generated. If it's generated, then it implies there must be
something wrong. The 0 return value would lead to a "suspicious NMI"
warning which is good to warn us there are something wrong.


>>  	}
>>  
>>  	base = cpuc->pebs_vaddr;
>> @@ -3089,6 +3096,7 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>>  
>>  		basic = at + sizeof(struct arch_pebs_header);
>>  		pebs_status = mask & basic->applicable_counters;
>> +		events_bitmap |= pebs_status;
>>  		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
>>  					       pebs_status, counts, last,
>>  					       setup_arch_pebs_sample_data);
>> @@ -3108,6 +3116,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>>  	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask,
>>  					    counts, last,
>>  					    setup_arch_pebs_sample_data);
>> +
> 	/*
> 	 * Comment that explains the arch pebs defect goes here.
> 	 */
>> +	return hweight64(events_bitmap);
>>  }

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-08  6:46     ` Mi, Dapeng
@ 2025-12-08  8:50       ` Peter Zijlstra
  2025-12-08  8:53         ` Mi, Dapeng
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2025-12-08  8:50 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

On Mon, Dec 08, 2025 at 02:46:44PM +0800, Mi, Dapeng wrote:

> This change only take effects for arch-PEBS. For the legacy PEBS, the
> "handled" would still be added 1 unconditionally even the *_drain_pebs()
> helpers always return 0.
> 
>     /*
>      * PEBS overflow sets bit 62 in the global status register
>      */
>     if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long
> *)&status)) {
>         u64 pebs_enabled = cpuc->pebs_enabled;
> 
>         handled++;
>         x86_pmu_handle_guest_pebs(regs, &data);
>         static_call(x86_pmu_drain_pebs)(regs, &data);
> 

Oh gawd. Please don't do that. If you change the calling convention of
that function, please have it be used consistently.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-08  8:50       ` Peter Zijlstra
@ 2025-12-08  8:53         ` Mi, Dapeng
  0 siblings, 0 replies; 55+ messages in thread
From: Mi, Dapeng @ 2025-12-08  8:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 12/8/2025 4:50 PM, Peter Zijlstra wrote:
> On Mon, Dec 08, 2025 at 02:46:44PM +0800, Mi, Dapeng wrote:
>
>> This change only take effects for arch-PEBS. For the legacy PEBS, the
>> "handled" would still be added 1 unconditionally even the *_drain_pebs()
>> helpers always return 0.
>>
>>     /*
>>      * PEBS overflow sets bit 62 in the global status register
>>      */
>>     if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long
>> *)&status)) {
>>         u64 pebs_enabled = cpuc->pebs_enabled;
>>
>>         handled++;
>>         x86_pmu_handle_guest_pebs(regs, &data);
>>         static_call(x86_pmu_drain_pebs)(regs, &data);
>>
> Oh gawd. Please don't do that. If you change the calling convention of
> that function, please have it be used consistently.

Sure. I would do same change for legacy PEBS and make the behavior consistent. 


>

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2025-12-08  8:53 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
2025-12-03  6:54 ` [Patch v5 01/19] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
2025-12-03  6:54 ` [Patch v5 02/19] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
2025-12-03  6:54 ` [Patch v5 03/19] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data() Dapeng Mi
2025-12-03  6:54 ` [Patch v5 04/19] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
2025-12-03  6:54 ` [Patch v5 05/19] perf: Move and rename has_extended_regs() for ARCH-specific use Dapeng Mi
2025-12-03  6:54 ` [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER Dapeng Mi
2025-12-04 15:17   ` Peter Zijlstra
2025-12-04 15:47     ` Peter Zijlstra
2025-12-05  6:37       ` Mi, Dapeng
2025-12-04 18:59     ` Dave Hansen
2025-12-05  8:42       ` Peter Zijlstra
2025-12-03  6:54 ` [Patch v5 07/19] perf: Add sampling support for SIMD registers Dapeng Mi
2025-12-05 11:07   ` Peter Zijlstra
2025-12-08  5:24     ` Mi, Dapeng
2025-12-05 11:40   ` Peter Zijlstra
2025-12-08  6:00     ` Mi, Dapeng
2025-12-03  6:54 ` [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
2025-12-05 11:25   ` Peter Zijlstra
2025-12-08  6:10     ` Mi, Dapeng
2025-12-03  6:54 ` [Patch v5 09/19] perf/x86: Enable YMM " Dapeng Mi
2025-12-03  6:54 ` [Patch v5 10/19] perf/x86: Enable ZMM " Dapeng Mi
2025-12-03  6:54 ` [Patch v5 11/19] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
2025-12-03  6:54 ` [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields Dapeng Mi
2025-12-05 12:16   ` Peter Zijlstra
2025-12-08  6:11     ` Mi, Dapeng
2025-12-03  6:54 ` [Patch v5 13/19] perf/x86: Enable SSP " Dapeng Mi
2025-12-05 12:20   ` Peter Zijlstra
2025-12-08  6:21     ` Mi, Dapeng
2025-12-03  6:54 ` [Patch v5 14/19] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
2025-12-03  6:54 ` [Patch v5 15/19] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
2025-12-03  6:54 ` [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
2025-12-05 12:39   ` Peter Zijlstra
2025-12-07 20:44     ` Andi Kleen
2025-12-08  6:46     ` Mi, Dapeng
2025-12-08  8:50       ` Peter Zijlstra
2025-12-08  8:53         ` Mi, Dapeng
2025-12-03  6:54 ` [Patch v5 17/19] perf headers: Sync with the kernel headers Dapeng Mi
2025-12-03 23:43   ` Ian Rogers
2025-12-04  1:37     ` Mi, Dapeng
2025-12-04  7:28       ` Ian Rogers
2025-12-03  6:54 ` [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format Dapeng Mi
2025-12-04  0:17   ` Ian Rogers
2025-12-04  2:58     ` Mi, Dapeng
2025-12-04  7:49       ` Ian Rogers
2025-12-04  9:20         ` Mi, Dapeng
2025-12-04 16:16           ` Ian Rogers
2025-12-05  4:00             ` Mi, Dapeng
2025-12-05  6:38               ` Ian Rogers
2025-12-05  8:10                 ` Mi, Dapeng
2025-12-05 16:35                   ` Ian Rogers
2025-12-08  4:20                     ` Mi, Dapeng
2025-12-03  6:55 ` [Patch v5 19/19] perf regs: Enable dumping of SIMD registers Dapeng Mi
2025-12-04  0:24 ` [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Ian Rogers
2025-12-04  3:28   ` Mi, Dapeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).