public inbox for linux-perf-users@vger.kernel.org
 help / color / mirror / Atom feed
* [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf
@ 2025-12-03  6:54 Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 01/19] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
                   ` (20 more replies)
  0 siblings, 21 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Changes since V4:
- Rewrite some functions comments and commit messages (Dave)
- Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
- Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
  activating back-to-back NMI detection mechanism (Patch 16/19)
- Fix some minor issues on perf-tool patches (Patch 18/19)

Changes since V3:
- Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
- Only dump the available regs, rather than zero and dump the
  unavailable regs. It's possible that the dumped registers are a subset
  of the requested registers.
- Some minor updates to address Dapeng's comments in V3.

Changes since V2:
- Use the FPU format for the x86_pmu.ext_regs_mask as well
- Add a check before invoking xsaves_nmi()
- Add perf_simd_reg_check() to retrieve the number of available
  registers. If the kernel fails to get the requested registers, e.g.,
  XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
- Add POC perf tool patches

Changes since V1:
- Apply the new interfaces to configure and dump the SIMD registers
- Utilize the existing FPU functions, e.g., xstate_calculate_size,
  get_xsave_addr().

Starting from Intel Ice Lake, XMM registers can be collected in a PEBS
record. Future Architecture PEBS will include additional registers such
as YMM, ZMM, OPMASK, SSP and APX eGPRs, contingent on hardware support.

This patch set introduces a software solution to mitigate the hardware
requirement by utilizing the XSAVES command to retrieve the requested
registers in the overflow handler. This feature is no longer limited to
PEBS events or specific platforms. While the hardware solution remains
preferable due to its lower overhead and higher accuracy, this software
approach provides a viable alternative.

The solution is theoretically compatible with all x86 platforms but is
currently enabled on newer platforms, including Sapphire Rapids and
later P-core server platforms, Sierra Forest and later E-core server
platforms and recent Client platforms, like Arrow Lake, Panther Lake and
Nova Lake.

Newly supported registers include YMM, ZMM, OPMASK, SSP, and APX eGPRs.
Due to space constraints in sample_regs_user/intr, new fields have been 
introduced in the perf_event_attr structure to accommodate these
registers.

After a long discussion in V1,
https://lore.kernel.org/lkml/3f1c9a9e-cb63-47ff-a5e9-06555fa6cc9a@linux.intel.com/
The below new fields are introduced.

@@ -543,6 +545,25 @@ struct perf_event_attr {
        __u64   sig_data;

        __u64   config3; /* extension of config2 */
+
+
+       /*
+        * Defines set of SIMD registers to dump on samples.
+        * The sample_simd_regs_enabled !=0 implies the
+        * set of SIMD registers is used to config all SIMD registers.
+        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+        * config some SIMD registers on X86.
+        */
+       union {
+               __u16 sample_simd_regs_enabled;
+               __u16 sample_simd_pred_reg_qwords;
+       };
+       __u32 sample_simd_pred_reg_intr;
+       __u32 sample_simd_pred_reg_user;
+       __u16 sample_simd_vec_reg_qwords;
+       __u64 sample_simd_vec_reg_intr;
+       __u64 sample_simd_vec_reg_user;
+       __u32 __reserved_4;
 };
@@ -1016,7 +1037,15 @@ enum perf_event_type {
         *      } && PERF_SAMPLE_BRANCH_STACK
         *
         *      { u64                   abi; # enum perf_sample_regs_abi
-        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;
+        *              u16 vector_qwords;
+        *              u16 nr_pred;
+        *              u16 pred_qwords;
+        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_USER
         *
         *      { u64                   size;
         *        char                  data[size];
@@ -1043,7 +1072,15 @@ enum perf_event_type {
         *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
         *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
         *      { u64                   abi; # enum perf_sample_regs_abi
-        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;
+        *              u16 vector_qwords;
+        *              u16 nr_pred;
+        *              u16 pred_qwords;
+        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_INTR
         *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
         *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
         *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE


To maintain simplicity, a single field, sample_{simd|pred}_vec_reg_qwords,
is introduced to indicate register width. For example:
- sample_simd_vec_reg_qwords = 2 for XMM registers (128 bits) on x86
- sample_simd_vec_reg_qwords = 4 for YMM registers (256 bits) on x86

Four additional fields, sample_{simd|pred}_vec_reg_{intr|user}, represent
the bitmap of sampling registers. For instance, the bitmap for x86
XMM registers is 0xffff (16 XMM registers). Although users can
theoretically sample a subset of registers, the current perf-tool
implementation supports sampling all registers of each type to avoid
complexity.

A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is introduced to signal user space 
tools about the presence of SIMD registers in sampling records. When this
flag is detected, tools should recognize that extra SIMD register data
follows the general register data. The layout of the extra SIMD register
data is displayed as follow.

   u16 nr_vectors;
   u16 vector_qwords;
   u16 nr_pred;
   u16 pred_qwords;
   u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];

With this patch set, sampling for the aforementioned registers is
supported on the Intel Nova Lake platform.

Examples:
 $perf record -I?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

 $perf record --user-regs=?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

 $perf record -e branches:p -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -c 100000 ./test
 $perf report -D

 ... ...
 14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
 0xffffffff9f085e24 period: 100000 addr: 0
 ... intr regs: mask 0x18001010003 ABI 64-bit
 .... AX    0xdffffc0000000000
 .... BX    0xffff8882297685e8
 .... R8    0x0000000000000000
 .... R16   0x0000000000000000
 .... R31   0x0000000000000000
 .... SSP   0x0000000000000000
 ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
 .... ZMM  [0] 0xffffffffffffffff
 .... ZMM  [0] 0x0000000000000001
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [0] 0x0000000000000000
 .... ZMM  [1] 0x003a6b6165506d56
 ... ...
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... ZMM  [31] 0x0000000000000000
 .... OPMASK[0] 0x00000000fffffe00
 .... OPMASK[1] 0x0000000000ffffff
 .... OPMASK[2] 0x000000000000007f
 .... OPMASK[3] 0x0000000000000000
 .... OPMASK[4] 0x0000000000010080
 .... OPMASK[5] 0x0000000000000000
 .... OPMASK[6] 0x0000400004000000
 .... OPMASK[7] 0x0000000000000000
 ... ...


History:
  v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
  v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
  v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
  v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/

Dapeng Mi (3):
  perf: Eliminate duplicate arch-specific functions definations
  perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling
  perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
    NMIs

Kan Liang (16):
  perf/x86: Use x86_perf_regs in the x86 nmi handler
  perf/x86: Introduce x86-specific x86_pmu_setup_regs_data()
  x86/fpu/xstate: Add xsaves_nmi() helper
  perf: Move and rename has_extended_regs() for ARCH-specific use
  perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  perf: Add sampling support for SIMD registers
  perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Enable YMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Enable ZMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields
  perf/x86: Enable eGPRs sampling using sample_regs_* fields
  perf/x86: Enable SSP sampling using sample_regs_* fields
  perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
  perf headers: Sync with the kernel headers
  perf parse-regs: Support new SIMD sampling format
  perf regs: Enable dumping of SIMD registers

 arch/arm/kernel/perf_regs.c                   |   8 +-
 arch/arm64/kernel/perf_regs.c                 |   8 +-
 arch/csky/kernel/perf_regs.c                  |   8 +-
 arch/loongarch/kernel/perf_regs.c             |   8 +-
 arch/mips/kernel/perf_regs.c                  |   8 +-
 arch/parisc/kernel/perf_regs.c                |   8 +-
 arch/powerpc/perf/perf_regs.c                 |   2 +-
 arch/riscv/kernel/perf_regs.c                 |   8 +-
 arch/s390/kernel/perf_regs.c                  |   2 +-
 arch/x86/events/core.c                        | 326 +++++++++++-
 arch/x86/events/intel/core.c                  | 117 ++++-
 arch/x86/events/intel/ds.c                    | 134 ++++-
 arch/x86/events/perf_event.h                  |  85 +++-
 arch/x86/include/asm/fpu/xstate.h             |   3 +
 arch/x86/include/asm/msr-index.h              |   7 +
 arch/x86/include/asm/perf_event.h             |  38 +-
 arch/x86/include/uapi/asm/perf_regs.h         |  62 +++
 arch/x86/kernel/fpu/xstate.c                  |  25 +-
 arch/x86/kernel/perf_regs.c                   | 131 ++++-
 include/linux/perf_event.h                    |  16 +
 include/linux/perf_regs.h                     |  36 +-
 include/uapi/linux/perf_event.h               |  45 +-
 kernel/events/core.c                          | 132 ++++-
 tools/arch/x86/include/uapi/asm/perf_regs.h   |  62 +++
 tools/include/uapi/linux/perf_event.h         |  45 +-
 tools/perf/arch/x86/util/perf_regs.c          | 470 +++++++++++++++++-
 tools/perf/util/evsel.c                       |  47 ++
 tools/perf/util/parse-regs-options.c          | 151 +++++-
 .../perf/util/perf-regs-arch/perf_regs_x86.c  |  43 ++
 tools/perf/util/perf_event_attr_fprintf.c     |   6 +
 tools/perf/util/perf_regs.c                   |  59 +++
 tools/perf/util/perf_regs.h                   |  11 +
 tools/perf/util/record.h                      |   6 +
 tools/perf/util/sample.h                      |  10 +
 tools/perf/util/session.c                     |  78 ++-
 35 files changed, 2012 insertions(+), 193 deletions(-)


base-commit: 9929dffce5ed7e2988e0274f4db98035508b16d9
prerequisite-patch-id: a15bcd62a8dcd219d17489eef88b66ea5488a2a0
-- 
2.34.1


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [Patch v5 01/19] perf: Eliminate duplicate arch-specific functions definations
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 02/19] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

Define default common __weak functions for perf_reg_value(),
perf_reg_validate(), perf_reg_abi() and perf_get_regs_user(). This helps
to eliminate the duplicated arch-specific definations.

No function changes intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/arm/kernel/perf_regs.c       |  6 ------
 arch/arm64/kernel/perf_regs.c     |  6 ------
 arch/csky/kernel/perf_regs.c      |  6 ------
 arch/loongarch/kernel/perf_regs.c |  6 ------
 arch/mips/kernel/perf_regs.c      |  6 ------
 arch/parisc/kernel/perf_regs.c    |  6 ------
 arch/riscv/kernel/perf_regs.c     |  6 ------
 arch/x86/kernel/perf_regs.c       |  6 ------
 include/linux/perf_regs.h         | 32 ++++++-------------------------
 kernel/events/core.c              | 22 +++++++++++++++++++++
 10 files changed, 28 insertions(+), 74 deletions(-)

diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index 0529f90395c9..d575a4c3ca56 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -31,9 +31,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index b4eece3eb17d..70e2f13f587f 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -98,9 +98,3 @@ u64 perf_reg_abi(struct task_struct *task)
 		return PERF_SAMPLE_REGS_ABI_64;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/csky/kernel/perf_regs.c b/arch/csky/kernel/perf_regs.c
index 09b7f88a2d6a..94601f37b596 100644
--- a/arch/csky/kernel/perf_regs.c
+++ b/arch/csky/kernel/perf_regs.c
@@ -31,9 +31,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/loongarch/kernel/perf_regs.c b/arch/loongarch/kernel/perf_regs.c
index 263ac4ab5af6..8dd604f01745 100644
--- a/arch/loongarch/kernel/perf_regs.c
+++ b/arch/loongarch/kernel/perf_regs.c
@@ -45,9 +45,3 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs->regs[idx];
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/mips/kernel/perf_regs.c b/arch/mips/kernel/perf_regs.c
index e686780d1647..7736d3c5ebd2 100644
--- a/arch/mips/kernel/perf_regs.c
+++ b/arch/mips/kernel/perf_regs.c
@@ -60,9 +60,3 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return (s64)v; /* Sign extend if 32-bit. */
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/parisc/kernel/perf_regs.c b/arch/parisc/kernel/perf_regs.c
index 68458e2f6197..87e6990569a7 100644
--- a/arch/parisc/kernel/perf_regs.c
+++ b/arch/parisc/kernel/perf_regs.c
@@ -53,9 +53,3 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_64;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/riscv/kernel/perf_regs.c b/arch/riscv/kernel/perf_regs.c
index fd304a248de6..3bba8deababb 100644
--- a/arch/riscv/kernel/perf_regs.c
+++ b/arch/riscv/kernel/perf_regs.c
@@ -35,9 +35,3 @@ u64 perf_reg_abi(struct task_struct *task)
 #endif
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 624703af80a1..81204cb7f723 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -100,12 +100,6 @@ u64 perf_reg_abi(struct task_struct *task)
 	return PERF_SAMPLE_REGS_ABI_32;
 }
 
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
 #else /* CONFIG_X86_64 */
 #define REG_NOSUPPORT ((1ULL << PERF_REG_X86_DS) | \
 		       (1ULL << PERF_REG_X86_ES) | \
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index f632c5725f16..144bcc3ff19f 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -9,6 +9,12 @@ struct perf_regs {
 	struct pt_regs	*regs;
 };
 
+u64 perf_reg_value(struct pt_regs *regs, int idx);
+int perf_reg_validate(u64 mask);
+u64 perf_reg_abi(struct task_struct *task);
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs);
+
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
 
@@ -16,35 +22,9 @@ struct perf_regs {
 #define PERF_REG_EXTENDED_MASK	0
 #endif
 
-u64 perf_reg_value(struct pt_regs *regs, int idx);
-int perf_reg_validate(u64 mask);
-u64 perf_reg_abi(struct task_struct *task);
-void perf_get_regs_user(struct perf_regs *regs_user,
-			struct pt_regs *regs);
 #else
 
 #define PERF_REG_EXTENDED_MASK	0
 
-static inline u64 perf_reg_value(struct pt_regs *regs, int idx)
-{
-	return 0;
-}
-
-static inline int perf_reg_validate(u64 mask)
-{
-	return mask ? -ENOSYS : 0;
-}
-
-static inline u64 perf_reg_abi(struct task_struct *task)
-{
-	return PERF_SAMPLE_REGS_ABI_NONE;
-}
-
-static inline void perf_get_regs_user(struct perf_regs *regs_user,
-				      struct pt_regs *regs)
-{
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
-}
 #endif /* CONFIG_HAVE_PERF_REGS */
 #endif /* _LINUX_PERF_REGS_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f6a08c73f783..efc938c6a2be 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7431,6 +7431,28 @@ unsigned long perf_instruction_pointer(struct perf_event *event,
 	return perf_arch_instruction_pointer(regs);
 }
 
+u64 __weak perf_reg_value(struct pt_regs *regs, int idx)
+{
+	return 0;
+}
+
+int __weak perf_reg_validate(u64 mask)
+{
+	return mask ? -ENOSYS : 0;
+}
+
+u64 __weak perf_reg_abi(struct task_struct *task)
+{
+	return PERF_SAMPLE_REGS_ABI_NONE;
+}
+
+void __weak perf_get_regs_user(struct perf_regs *regs_user,
+			       struct pt_regs *regs)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
+
 static void
 perf_output_sample_regs(struct perf_output_handle *handle,
 			struct pt_regs *regs, u64 mask)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 02/19] perf/x86: Use x86_perf_regs in the x86 nmi handler
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 01/19] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 03/19] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data() Dapeng Mi
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

More and more regs will be supported in the overflow, e.g., more vector
registers, SSP, etc. The generic pt_regs struct cannot store all of
them. Use a X86 specific x86_perf_regs instead.

The struct pt_regs *regs is still passed to x86_pmu_handle_irq(). There
is no functional change for the existing code.

AMD IBS's NMI handler doesn't utilize the static call
x86_pmu_handle_irq(). The x86_perf_regs struct doesn't apply to the AMD
IBS. It can be added separately later when AMD IBS supports more regs.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5d0d5e466c62..ef3bf8fbc97f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1762,6 +1762,7 @@ void perf_events_lapic_init(void)
 static int
 perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
+	struct x86_perf_regs x86_regs;
 	u64 start_clock;
 	u64 finish_clock;
 	int ret;
@@ -1774,7 +1775,8 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 		return NMI_DONE;
 
 	start_clock = sched_clock();
-	ret = static_call(x86_pmu_handle_irq)(regs);
+	x86_regs.regs = *regs;
+	ret = static_call(x86_pmu_handle_irq)(&x86_regs.regs);
 	finish_clock = sched_clock();
 
 	perf_sample_event_took(finish_clock - start_clock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 03/19] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data()
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 01/19] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 02/19] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 04/19] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The current perf/x86 implementation uses the generic functions
perf_sample_regs_user() and perf_sample_regs_intr() to set up registers
data for sampling records. While this approach works for general
registers, it falls short when adding sampling support for SIMD and APX
eGPRs registers on x86 platforms.

To address this, we introduce the x86-specific function
x86_pmu_setup_regs_data() for setting up register data on x86 platforms.

At present, x86_pmu_setup_regs_data() mirrors the logic of the generic
functions perf_sample_regs_user() and perf_sample_regs_intr().
Subsequent patches will introduce x86-specific enhancements.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c       | 32 ++++++++++++++++++++++++++++++++
 arch/x86/events/intel/ds.c   |  9 ++++++---
 arch/x86/events/perf_event.h |  4 ++++
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index ef3bf8fbc97f..dcdd2c2d68ee 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1695,6 +1695,38 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs)
+{
+	u64 sample_type = event->attr.sample_type;
+
+	if (sample_type & PERF_SAMPLE_REGS_USER) {
+		if (user_mode(regs)) {
+			data->regs_user.abi = perf_reg_abi(current);
+			data->regs_user.regs = regs;
+		} else if (!(current->flags & PF_KTHREAD)) {
+			perf_get_regs_user(&data->regs_user, regs);
+		} else {
+			data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
+			data->regs_user.regs = NULL;
+		}
+		data->dyn_size += sizeof(u64);
+		if (data->regs_user.regs)
+			data->dyn_size += hweight64(event->attr.sample_regs_user) * sizeof(u64);
+		data->sample_flags |= PERF_SAMPLE_REGS_USER;
+	}
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR) {
+		data->regs_intr.regs = regs;
+		data->regs_intr.abi = perf_reg_abi(current);
+		data->dyn_size += sizeof(u64);
+		if (data->regs_intr.regs)
+			data->dyn_size += hweight64(event->attr.sample_regs_intr) * sizeof(u64);
+		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
+	}
+}
+
 int x86_pmu_handle_irq(struct pt_regs *regs)
 {
 	struct perf_sample_data data;
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 2e170f2093ac..c7351f476d8c 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2180,6 +2180,7 @@ static inline void __setup_pebs_basic_group(struct perf_event *event,
 }
 
 static inline void __setup_pebs_gpr_group(struct perf_event *event,
+					  struct perf_sample_data *data,
 					  struct pt_regs *regs,
 					  struct pebs_gprs *gprs,
 					  u64 sample_type)
@@ -2189,8 +2190,10 @@ static inline void __setup_pebs_gpr_group(struct perf_event *event,
 		regs->flags &= ~PERF_EFLAGS_EXACT;
 	}
 
-	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
+	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
 		adaptive_pebs_save_regs(regs, gprs);
+		x86_pmu_setup_regs_data(event, data, regs);
+	}
 }
 
 static inline void __setup_pebs_meminfo_group(struct perf_event *event,
@@ -2283,7 +2286,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		gprs = next_record;
 		next_record = gprs + 1;
 
-		__setup_pebs_gpr_group(event, regs, gprs, sample_type);
+		__setup_pebs_gpr_group(event, data, regs, gprs, sample_type);
 	}
 
 	if (format_group & PEBS_DATACFG_MEMINFO) {
@@ -2407,7 +2410,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		gprs = next_record;
 		next_record = gprs + 1;
 
-		__setup_pebs_gpr_group(event, regs,
+		__setup_pebs_gpr_group(event, data, regs,
 				       (struct pebs_gprs *)gprs,
 				       sample_type);
 	}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 3161ec0a3416..80e52e937638 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1294,6 +1294,10 @@ void x86_pmu_enable_event(struct perf_event *event);
 
 int x86_pmu_handle_irq(struct pt_regs *regs);
 
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs);
+
 void x86_pmu_show_pmu_cap(struct pmu *pmu);
 
 static inline int x86_pmu_num_counters(struct pmu *pmu)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 04/19] x86/fpu/xstate: Add xsaves_nmi() helper
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (2 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 03/19] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data() Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 05/19] perf: Move and rename has_extended_regs() for ARCH-specific use Dapeng Mi
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Add xsaves_nmi() to save supported xsave states in NMI handler.

This function is similar to xsaves(), but should only be called within
a NMI handler. This function returns the actual register contents at
the moment the NMI occurs.

Currently the perf subsystem is the sole user of this helper. It uses
this function to snapshot SIMD (XMM/YMM/ZMM) and APX eGPRs registers
which would be added in subsequent patches.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/fpu/xstate.h |  1 +
 arch/x86/kernel/fpu/xstate.c      | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 7a7dc9d56027..38fa8ff26559 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -110,6 +110,7 @@ int xfeature_size(int xfeature_nr);
 
 void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
+void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
 int xfd_enable_feature(u64 xfd_err);
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 28e4fd65c9da..e3b8afed8b2c 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1474,6 +1474,29 @@ void xrstors(struct xregs_state *xstate, u64 mask)
 	WARN_ON_ONCE(err);
 }
 
+/**
+ * xsaves_nmi - Save selected components to a kernel xstate buffer in NMI
+ * @xstate:	Pointer to the buffer
+ * @mask:	Feature mask to select the components to save
+ *
+ * This function is similar to xsaves(), but should only be called within
+ * a NMI handler. This function returns the actual register contents at
+ * the moment the NMI occurs.
+ *
+ * Currently, the perf subsystem is the sole user of this helper. It uses
+ * the function to snapshot SIMD (XMM/YMM/ZMM) and APX eGPRs registers.
+ */
+void xsaves_nmi(struct xregs_state *xstate, u64 mask)
+{
+	int err;
+
+	if (!in_nmi())
+		return;
+
+	XSTATE_OP(XSAVES, xstate, (u32)mask, (u32)(mask >> 32), err);
+	WARN_ON_ONCE(err);
+}
+
 #if IS_ENABLED(CONFIG_KVM)
 void fpstate_clear_xstate_component(struct fpstate *fpstate, unsigned int xfeature)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 05/19] perf: Move and rename has_extended_regs() for ARCH-specific use
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (3 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 04/19] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER Dapeng Mi
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The has_extended_regs() function will be utilized in ARCH-specific code.
To facilitate this, move it to header file perf_event.h

Additionally, the function is renamed to event_has_extended_regs() which
aligns with the existing naming conventions.

No functional change intended.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 include/linux/perf_event.h | 8 ++++++++
 kernel/events/core.c       | 8 +-------
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9870d768db4c..5153b70d09c8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1526,6 +1526,14 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_extended_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return (attr->sample_regs_user & PERF_REG_EXTENDED_MASK) ||
+	       (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK);
+}
+
 static inline bool event_has_any_exclude_flag(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index efc938c6a2be..3e9c48fa2202 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12664,12 +12664,6 @@ int perf_pmu_unregister(struct pmu *pmu)
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
-static inline bool has_extended_regs(struct perf_event *event)
-{
-	return (event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK) ||
-	       (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK);
-}
-
 static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 {
 	struct perf_event_context *ctx = NULL;
@@ -12704,7 +12698,7 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 		goto err_pmu;
 
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
-	    has_extended_regs(event)) {
+	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
 		goto err_destroy;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (4 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 05/19] perf: Move and rename has_extended_regs() for ARCH-specific use Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-04 15:17   ` Peter Zijlstra
  2025-12-03  6:54 ` [Patch v5 07/19] perf: Add sampling support for SIMD registers Dapeng Mi
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

While collecting XMM registers in a PEBS record has been supported since
Icelake, non-PEBS events have lacked this feature. By leveraging the
xsaves instruction, it is now possible to snapshot XMM registers for
non-PEBS events, completing the feature set.

To utilize the xsaves instruction, a 64-byte aligned buffer is required.
A per-CPU ext_regs_buf is added to store SIMD and other registers, with
the buffer size being approximately 2K. The buffer is allocated using
kzalloc_node(), ensuring natural alignment and 64-byte alignment for all
kmalloc() allocations with powers of 2.

The XMM sampling support is extended for both REGS_USER and REGS_INTR.
For REGS_USER, perf_get_regs_user() returns the registers from
task_pt_regs(current), which is a pt_regs structure. It needs to be
copied to user space secific x86_user_regs structure since kernel may
modify pt_regs structure later.

For PEBS, XMM registers are retrieved from PEBS records.

In cases where userspace tasks are trapped within kernel mode (e.g.,
during a syscall) when an NMI arrives, pt_regs information can still be
retrieved from task_pt_regs(). However, capturing SIMD and other
xsave-based registers in this scenario is challenging. Therefore,
snapshots for these registers are omitted in such cases.

The reasons are:
- Profiling a userspace task that requires SIMD/eGPR registers typically
  involves NMIs hitting userspace, not kernel mode.
- Although it is possible to retrieve values when the TIF_NEED_FPU_LOAD
  flag is set, the complexity introduced to handle this uncommon case in
  the critical path is not justified.
- Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
  Some corner cases, such as an NMI occurring just after the flag switches
  but still in kernel mode, cannot be handled.

Future support for additional vector registers is anticipated.
An ext_regs_mask is added to track the supported vector register groups.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c            | 175 ++++++++++++++++++++++++++----
 arch/x86/events/intel/core.c      |  29 ++++-
 arch/x86/events/intel/ds.c        |  20 ++--
 arch/x86/events/perf_event.h      |  11 +-
 arch/x86/include/asm/fpu/xstate.h |   2 +
 arch/x86/include/asm/perf_event.h |   5 +-
 arch/x86/kernel/fpu/xstate.c      |   2 +-
 7 files changed, 212 insertions(+), 32 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index dcdd2c2d68ee..0d33668b1927 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -406,6 +406,62 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event)
 	return x86_pmu_extra_regs(val, event);
 }
 
+static DEFINE_PER_CPU(struct xregs_state *, ext_regs_buf);
+
+static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
+{
+	struct xregs_state *xsave = per_cpu(ext_regs_buf, smp_processor_id());
+	u64 valid_mask = x86_pmu.ext_regs_mask & mask;
+
+	if (WARN_ON_ONCE(!xsave))
+		return;
+
+	xsaves_nmi(xsave, valid_mask);
+
+	/* Filtered by what XSAVE really gives */
+	valid_mask &= xsave->header.xfeatures;
+
+	if (valid_mask & XFEATURE_MASK_SSE)
+		perf_regs->xmm_space = xsave->i387.xmm_space;
+}
+
+static void release_ext_regs_buffers(void)
+{
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		kfree(per_cpu(ext_regs_buf, cpu));
+		per_cpu(ext_regs_buf, cpu) = NULL;
+	}
+}
+
+static void reserve_ext_regs_buffers(void)
+{
+	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
+	unsigned int size;
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	size = xstate_calculate_size(x86_pmu.ext_regs_mask, compacted);
+
+	for_each_possible_cpu(cpu) {
+		per_cpu(ext_regs_buf, cpu) = kzalloc_node(size, GFP_KERNEL,
+							  cpu_to_node(cpu));
+		if (!per_cpu(ext_regs_buf, cpu))
+			goto err;
+	}
+
+	return;
+
+err:
+	release_ext_regs_buffers();
+}
+
 int x86_reserve_hardware(void)
 {
 	int err = 0;
@@ -418,6 +474,7 @@ int x86_reserve_hardware(void)
 			} else {
 				reserve_ds_buffers();
 				reserve_lbr_buffers();
+				reserve_ext_regs_buffers();
 			}
 		}
 		if (!err)
@@ -434,6 +491,7 @@ void x86_release_hardware(void)
 		release_pmc_hardware();
 		release_ds_buffers();
 		release_lbr_buffers();
+		release_ext_regs_buffers();
 		mutex_unlock(&pmc_reserve_mutex);
 	}
 }
@@ -651,19 +709,17 @@ int x86_pmu_hw_config(struct perf_event *event)
 			return -EINVAL;
 	}
 
-	/* sample_regs_user never support XMM registers */
-	if (unlikely(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK))
-		return -EINVAL;
-	/*
-	 * Besides the general purpose registers, XMM registers may
-	 * be collected in PEBS on some platforms, e.g. Icelake
-	 */
-	if (unlikely(event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK)) {
-		if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
-			return -EINVAL;
-
-		if (!event->attr.precise_ip)
-			return -EINVAL;
+	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
+		/*
+		 * Besides the general purpose registers, XMM registers may
+		 * be collected as well.
+		 */
+		if (event_has_extended_regs(event)) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+				return -EINVAL;
+			if (!event->attr.precise_ip)
+				return -EINVAL;
+		}
 	}
 
 	return x86_setup_perfctr(event);
@@ -1695,38 +1751,115 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
-void x86_pmu_setup_regs_data(struct perf_event *event,
-			     struct perf_sample_data *data,
-			     struct pt_regs *regs)
+static DEFINE_PER_CPU(struct x86_perf_regs, x86_user_regs);
+
+static struct x86_perf_regs *
+x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
+			   struct pt_regs *regs)
+{
+	struct x86_perf_regs *x86_regs_user = this_cpu_ptr(&x86_user_regs);
+	struct perf_regs regs_user;
+
+	perf_get_regs_user(&regs_user, regs);
+	data->regs_user.abi = regs_user.abi;
+	if (regs_user.regs) {
+		x86_regs_user->regs = *regs_user.regs;
+		data->regs_user.regs = &x86_regs_user->regs;
+	} else
+		data->regs_user.regs = NULL;
+	return x86_regs_user;
+}
+
+static bool x86_pmu_user_req_pt_regs_only(struct perf_event *event)
 {
-	u64 sample_type = event->attr.sample_type;
+	return !(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK);
+}
+
+inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
+{
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+	perf_regs->xmm_regs = NULL;
+}
+
+static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
+					  struct perf_sample_data *data,
+					  struct pt_regs *regs)
+{
+	struct perf_event_attr *attr = &event->attr;
+	u64 sample_type = attr->sample_type;
+	struct x86_perf_regs *perf_regs;
+
+	if (!(attr->sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)))
+		return;
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
+		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
 		if (user_mode(regs)) {
 			data->regs_user.abi = perf_reg_abi(current);
 			data->regs_user.regs = regs;
-		} else if (!(current->flags & PF_KTHREAD)) {
-			perf_get_regs_user(&data->regs_user, regs);
+		} else if (!(current->flags & PF_KTHREAD) &&
+			   x86_pmu_user_req_pt_regs_only(event)) {
+			/*
+			 * It cannot guarantee that the kernel will never
+			 * touch the registers outside of the pt_regs,
+			 * especially when more and more registers
+			 * (e.g., SIMD, eGPR) are added. The live data
+			 * cannot be used.
+			 * Dump the registers when only pt_regs are required.
+			 */
+			perf_regs = x86_pmu_perf_get_regs_user(data, regs);
 		} else {
 			data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
 			data->regs_user.regs = NULL;
 		}
 		data->dyn_size += sizeof(u64);
 		if (data->regs_user.regs)
-			data->dyn_size += hweight64(event->attr.sample_regs_user) * sizeof(u64);
+			data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
 		data->sample_flags |= PERF_SAMPLE_REGS_USER;
 	}
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR) {
+		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
 		data->regs_intr.regs = regs;
 		data->regs_intr.abi = perf_reg_abi(current);
 		data->dyn_size += sizeof(u64);
 		if (data->regs_intr.regs)
-			data->dyn_size += hweight64(event->attr.sample_regs_intr) * sizeof(u64);
+			data->dyn_size += hweight64(attr->sample_regs_intr) * sizeof(u64);
 		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
 	}
 }
 
+static void x86_pmu_sample_ext_regs(struct perf_event *event,
+				    struct pt_regs *regs,
+				    u64 ignore_mask)
+{
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+	u64 mask = 0;
+
+	if (event_has_extended_regs(event))
+		mask |= XFEATURE_MASK_SSE;
+
+	mask &= ~ignore_mask;
+	if (mask)
+		x86_pmu_get_ext_regs(perf_regs, mask);
+}
+
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs,
+			     u64 ignore_mask)
+{
+	x86_pmu_setup_basic_regs_data(event, data, regs);
+	/*
+	 * ignore_mask indicates the PEBS sampled extended regs
+	 * which is unnessary to sample again.
+	 */
+	x86_pmu_sample_ext_regs(event, regs, ignore_mask);
+}
+
 int x86_pmu_handle_irq(struct pt_regs *regs)
 {
 	struct perf_sample_data data;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 81e6c8bcabde..b5c89e8eabb2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3410,6 +3410,9 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 		if (has_branch_stack(event))
 			intel_pmu_lbr_save_brstack(&data, cpuc, event);
 
+		x86_pmu_clear_perf_regs(regs);
+		x86_pmu_setup_regs_data(event, &data, regs, 0);
+
 		perf_event_overflow(event, &data, regs);
 	}
 
@@ -5619,8 +5622,30 @@ static inline void __intel_update_large_pebs_flags(struct pmu *pmu)
 	}
 }
 
-#define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
+static void intel_extended_regs_init(struct pmu *pmu)
+{
+	/*
+	 * Extend the vector registers support to non-PEBS.
+	 * The feature is limited to newer Intel machines with
+	 * PEBS V4+ or archPerfmonExt (0x23) enabled for now.
+	 * In theory, the vector registers can be retrieved as
+	 * long as the CPU supports. The support for the old
+	 * generations may be added later if there is a
+	 * requirement.
+	 * Only support the extension when XSAVES is available.
+	 */
+	if (!boot_cpu_has(X86_FEATURE_XSAVES))
+		return;
 
+	if (!boot_cpu_has(X86_FEATURE_XMM) ||
+	    !cpu_has_xfeatures(XFEATURE_MASK_SSE, NULL))
+		return;
+
+	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
+	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+}
+
+#define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
 static void update_pmu_cap(struct pmu *pmu)
 {
 	unsigned int eax, ebx, ecx, edx;
@@ -5682,6 +5707,8 @@ static void update_pmu_cap(struct pmu *pmu)
 		/* Perf Metric (Bit 15) and PEBS via PT (Bit 16) are hybrid enumeration */
 		rdmsrq(MSR_IA32_PERF_CAPABILITIES, hybrid(pmu, intel_cap).capabilities);
 	}
+
+	intel_extended_regs_init(pmu);
 }
 
 static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index c7351f476d8c..af462f69cd1c 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1473,8 +1473,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
-	    (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
+	if (event_has_extended_regs(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
@@ -2190,10 +2189,8 @@ static inline void __setup_pebs_gpr_group(struct perf_event *event,
 		regs->flags &= ~PERF_EFLAGS_EXACT;
 	}
 
-	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
+	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
 		adaptive_pebs_save_regs(regs, gprs);
-		x86_pmu_setup_regs_data(event, data, regs);
-	}
 }
 
 static inline void __setup_pebs_meminfo_group(struct perf_event *event,
@@ -2251,6 +2248,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 	struct pebs_meminfo *meminfo = NULL;
 	struct pebs_gprs *gprs = NULL;
 	struct x86_perf_regs *perf_regs;
+	u64 ignore_mask = 0;
 	u64 format_group;
 	u16 retire;
 
@@ -2258,7 +2256,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		return;
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
-	perf_regs->xmm_regs = NULL;
+	x86_pmu_clear_perf_regs(regs);
 
 	format_group = basic->format_group;
 
@@ -2305,6 +2303,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 	if (format_group & PEBS_DATACFG_XMMS) {
 		struct pebs_xmm *xmm = next_record;
 
+		ignore_mask |= XFEATURE_MASK_SSE;
 		next_record = xmm + 1;
 		perf_regs->xmm_regs = xmm->xmm;
 	}
@@ -2343,6 +2342,8 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		next_record += nr * sizeof(u64);
 	}
 
+	x86_pmu_setup_regs_data(event, data, regs, ignore_mask);
+
 	WARN_ONCE(next_record != __pebs + basic->format_size,
 			"PEBS record size %u, expected %llu, config %llx\n",
 			basic->format_size,
@@ -2368,6 +2369,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 	struct arch_pebs_aux *meminfo = NULL;
 	struct arch_pebs_gprs *gprs = NULL;
 	struct x86_perf_regs *perf_regs;
+	u64 ignore_mask = 0;
 	void *next_record;
 	void *at = __pebs;
 
@@ -2375,7 +2377,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		return;
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
-	perf_regs->xmm_regs = NULL;
+	x86_pmu_clear_perf_regs(regs);
 
 	__setup_perf_sample_data(event, iregs, data);
 
@@ -2430,6 +2432,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 
 		next_record += sizeof(struct arch_pebs_xer_header);
 
+		ignore_mask |= XFEATURE_MASK_SSE;
 		xmm = next_record;
 		perf_regs->xmm_regs = xmm->xmm;
 		next_record = xmm + 1;
@@ -2477,6 +2480,8 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		at = at + header->size;
 		goto again;
 	}
+
+	x86_pmu_setup_regs_data(event, data, regs, ignore_mask);
 }
 
 static inline void *
@@ -3137,6 +3142,7 @@ static void __init intel_ds_pebs_init(void)
 				x86_pmu.flags |= PMU_FL_PEBS_ALL;
 				x86_pmu.pebs_capable = ~0ULL;
 				pebs_qual = "-baseline";
+				x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
 				x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
 			} else {
 				/* Only basic record supported */
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 80e52e937638..3c470d79aa65 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1009,6 +1009,12 @@ struct x86_pmu {
 	struct extra_reg *extra_regs;
 	unsigned int flags;
 
+	/*
+	 * Extended regs, e.g., vector registers
+	 * Utilize the same format as the XFEATURE_MASK_*
+	 */
+	u64		ext_regs_mask;
+
 	/*
 	 * Intel host/guest support (KVM)
 	 */
@@ -1294,9 +1300,12 @@ void x86_pmu_enable_event(struct perf_event *event);
 
 int x86_pmu_handle_irq(struct pt_regs *regs);
 
+void x86_pmu_clear_perf_regs(struct pt_regs *regs);
+
 void x86_pmu_setup_regs_data(struct perf_event *event,
 			     struct perf_sample_data *data,
-			     struct pt_regs *regs);
+			     struct pt_regs *regs,
+			     u64 ignore_mask);
 
 void x86_pmu_show_pmu_cap(struct pmu *pmu);
 
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 38fa8ff26559..19dec5f0b1c7 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -112,6 +112,8 @@ void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
 void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
+
 int xfd_enable_feature(u64 xfd_err);
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 7276ba70c88a..3b368de9f803 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -704,7 +704,10 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
-	u64		*xmm_regs;
+	union {
+		u64	*xmm_regs;
+		u32	*xmm_space;	/* for xsaves */
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index e3b8afed8b2c..33142bccc075 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -586,7 +586,7 @@ static bool __init check_xstate_against_struct(int nr)
 	return true;
 }
 
-static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
 {
 	unsigned int topmost = fls64(xfeatures) -  1;
 	unsigned int offset, i;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 07/19] perf: Add sampling support for SIMD registers
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (5 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-05 11:07   ` Peter Zijlstra
  2025-12-05 11:40   ` Peter Zijlstra
  2025-12-03  6:54 ` [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
                   ` (13 subsequent siblings)
  20 siblings, 2 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Users may be interested in sampling SIMD registers during profiling.
The current sample_regs_* structure does not have sufficient space
for all SIMD registers.

To address this, new attribute fields sample_simd_{pred,vec}_reg_* are
added to struct perf_event_attr to represent the SIMD registers that are
expected to be sampled.

Currently, the perf/x86 code supports XMM registers in sample_regs_*.
To unify the configuration of SIMD registers and ensure a consistent
method for configuring XMM and other SIMD registers, a new event
attribute field, sample_simd_regs_enabled, is introduced. When
sample_simd_regs_enabled is set, it indicates that all SIMD registers,
including XMM, will be represented by the newly introduced
sample_simd_{pred|vec}_reg_* fields. The original XMM space in
sample_regs_* is reserved for future uses.

Since SIMD registers are wider than 64 bits, a new output format is
introduced. The number and width of SIMD registers are dumped first,
followed by the register values. The number and width are based on the
user's configuration. If they differ (e.g., on ARM), an ARCH-specific
perf_output_sample_simd_regs function can be implemented separately.

A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is added to indicate the new format.
The enum perf_sample_regs_abi is now a bitmap. This change should not
impact existing tools, as the version and bitmap remain the same for
values 1 and 2.

Additionally, two new __weak functions are introduced:
- perf_simd_reg_value(): Retrieves the value of the requested SIMD
  register.
- perf_simd_reg_validate(): Validates the configuration of the SIMD
  registers.

A new flag, PERF_PMU_CAP_SIMD_REGS, is added to indicate that the PMU
supports SIMD register dumping. An error is generated if
sample_simd_{pred|vec}_reg_* is mistakenly set for a PMU that does not
support this capability.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 include/linux/perf_event.h      |  8 +++
 include/linux/perf_regs.h       |  4 ++
 include/uapi/linux/perf_event.h | 45 ++++++++++++++--
 kernel/events/core.c            | 96 +++++++++++++++++++++++++++++++--
 4 files changed, 146 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5153b70d09c8..87d3bdbef30e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -305,6 +305,7 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_EXTENDED_HW_TYPE	0x0100
 #define PERF_PMU_CAP_AUX_PAUSE		0x0200
 #define PERF_PMU_CAP_AUX_PREFER_LARGE	0x0400
+#define PERF_PMU_CAP_SIMD_REGS		0x0800
 
 /**
  * pmu::scope
@@ -1526,6 +1527,13 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_simd_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return attr->sample_simd_regs_enabled != 0;
+}
+
 static inline bool event_has_extended_regs(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 144bcc3ff19f..518f28c6a7d4 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -14,6 +14,10 @@ int perf_reg_validate(u64 mask);
 u64 perf_reg_abi(struct task_struct *task);
 void perf_get_regs_user(struct perf_regs *regs_user,
 			struct pt_regs *regs);
+int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+			   u16 pred_qwords, u32 pred_mask);
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred);
 
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d292f96bc06f..f1474da32622 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -314,8 +314,9 @@ enum {
  */
 enum perf_sample_regs_abi {
 	PERF_SAMPLE_REGS_ABI_NONE		= 0,
-	PERF_SAMPLE_REGS_ABI_32			= 1,
-	PERF_SAMPLE_REGS_ABI_64			= 2,
+	PERF_SAMPLE_REGS_ABI_32			= (1 << 0),
+	PERF_SAMPLE_REGS_ABI_64			= (1 << 1),
+	PERF_SAMPLE_REGS_ABI_SIMD		= (1 << 2),
 };
 
 /*
@@ -382,6 +383,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
 #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
+#define PERF_ATTR_SIZE_VER9			168	/* Add: sample_simd_{pred,vec}_reg_* */
 
 /*
  * 'struct perf_event_attr' contains various attributes that define
@@ -545,6 +547,25 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+
+	/*
+	 * Defines set of SIMD registers to dump on samples.
+	 * The sample_simd_regs_enabled !=0 implies the
+	 * set of SIMD registers is used to config all SIMD registers.
+	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+	 * config some SIMD registers on X86.
+	 */
+	union {
+		__u16 sample_simd_regs_enabled;
+		__u16 sample_simd_pred_reg_qwords;
+	};
+	__u32 sample_simd_pred_reg_intr;
+	__u32 sample_simd_pred_reg_user;
+	__u16 sample_simd_vec_reg_qwords;
+	__u64 sample_simd_vec_reg_intr;
+	__u64 sample_simd_vec_reg_user;
+	__u32 __reserved_4;
 };
 
 /*
@@ -1018,7 +1039,15 @@ enum perf_event_type {
 	 *      } && PERF_SAMPLE_BRANCH_STACK
 	 *
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_USER
 	 *
 	 *	{ u64			size;
 	 *	  char			data[size];
@@ -1045,7 +1074,15 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3e9c48fa2202..b19de038979e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7469,6 +7469,50 @@ perf_output_sample_regs(struct perf_output_handle *handle,
 	}
 }
 
+static void
+perf_output_sample_simd_regs(struct perf_output_handle *handle,
+			     struct perf_event *event,
+			     struct pt_regs *regs,
+			     u64 mask, u16 pred_mask)
+{
+	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
+	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
+	u64 pred_bitmap = pred_mask;
+	u64 bitmap = mask;
+	u16 nr_vectors;
+	u16 nr_pred;
+	int bit;
+	u64 val;
+	u16 i;
+
+	nr_vectors = hweight64(bitmap);
+	nr_pred = hweight64(pred_bitmap);
+
+	perf_output_put(handle, nr_vectors);
+	perf_output_put(handle, vec_qwords);
+	perf_output_put(handle, nr_pred);
+	perf_output_put(handle, pred_qwords);
+
+	if (nr_vectors) {
+		for_each_set_bit(bit, (unsigned long *)&bitmap,
+				 sizeof(bitmap) * BITS_PER_BYTE) {
+			for (i = 0; i < vec_qwords; i++) {
+				val = perf_simd_reg_value(regs, bit, i, false);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+	if (nr_pred) {
+		for_each_set_bit(bit, (unsigned long *)&pred_bitmap,
+				 sizeof(pred_bitmap) * BITS_PER_BYTE) {
+			for (i = 0; i < pred_qwords; i++) {
+				val = perf_simd_reg_value(regs, bit, i, true);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+}
+
 static void perf_sample_regs_user(struct perf_regs *regs_user,
 				  struct pt_regs *regs)
 {
@@ -7490,6 +7534,17 @@ static void perf_sample_regs_intr(struct perf_regs *regs_intr,
 	regs_intr->abi  = perf_reg_abi(current);
 }
 
+int __weak perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+				  u16 pred_qwords, u32 pred_mask)
+{
+	return vec_qwords || vec_mask || pred_qwords || pred_mask ? -ENOSYS : 0;
+}
+
+u64 __weak perf_simd_reg_value(struct pt_regs *regs, int idx,
+			       u16 qwords_idx, bool pred)
+{
+	return 0;
+}
 
 /*
  * Get remaining task size from user stack pointer.
@@ -8022,10 +8077,17 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_user;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_user;
 			perf_output_sample_regs(handle,
 						data->regs_user.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_user.regs,
+							     attr->sample_simd_vec_reg_user,
+							     attr->sample_simd_pred_reg_user);
+			}
 		}
 	}
 
@@ -8053,11 +8115,18 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_intr;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_intr;
 
 			perf_output_sample_regs(handle,
 						data->regs_intr.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_intr.regs,
+							     attr->sample_simd_vec_reg_intr,
+							     attr->sample_simd_pred_reg_intr);
+			}
 		}
 	}
 
@@ -12697,6 +12766,12 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 	if (ret)
 		goto err_pmu;
 
+	if (!(pmu->capabilities & PERF_PMU_CAP_SIMD_REGS) &&
+	    event_has_simd_regs(event)) {
+		ret = -EOPNOTSUPP;
+		goto err_destroy;
+	}
+
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
 	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
@@ -13238,6 +13313,12 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		ret = perf_reg_validate(attr->sample_regs_user);
 		if (ret)
 			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_user,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_user);
+		if (ret)
+			return ret;
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_STACK_USER) {
@@ -13258,8 +13339,17 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (!attr->sample_max_stack)
 		attr->sample_max_stack = sysctl_perf_event_max_stack;
 
-	if (attr->sample_type & PERF_SAMPLE_REGS_INTR)
+	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
 		ret = perf_reg_validate(attr->sample_regs_intr);
+		if (ret)
+			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_intr,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_intr);
+		if (ret)
+			return ret;
+	}
 
 #ifndef CONFIG_CGROUP_PERF
 	if (attr->sample_type & PERF_SAMPLE_CGROUP)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (6 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 07/19] perf: Add sampling support for SIMD registers Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-05 11:25   ` Peter Zijlstra
  2025-12-03  6:54 ` [Patch v5 09/19] perf/x86: Enable YMM " Dapeng Mi
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for sampling XMM registers using the
sample_simd_vec_reg_* fields.

When sample_simd_regs_enabled is set, the original XMM space in the
sample_regs_* field is treated as reserved. An INVAL error will be
reported to user space if any bit is set in the original XMM space while
sample_simd_regs_enabled is set.

The perf_reg_value function requires ABI information to understand the
layout of sample_regs. To accommodate this, a new abi field is introduced
in the struct x86_perf_regs to represent ABI information.

Additionally, the X86-specific perf_simd_reg_value function is implemented
to retrieve the XMM register values.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 78 ++++++++++++++++++++++++++-
 arch/x86/events/intel/ds.c            |  2 +-
 arch/x86/events/perf_event.h          | 12 +++++
 arch/x86/include/asm/perf_event.h     |  1 +
 arch/x86/include/uapi/asm/perf_regs.h | 17 ++++++
 arch/x86/kernel/perf_regs.c           | 51 +++++++++++++++++-
 6 files changed, 158 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 0d33668b1927..8f7e7e81daaf 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -719,6 +719,22 @@ int x86_pmu_hw_config(struct perf_event *event)
 				return -EINVAL;
 			if (!event->attr.precise_ip)
 				return -EINVAL;
+			if (event->attr.sample_simd_regs_enabled)
+				return -EINVAL;
+		}
+
+		if (event_has_simd_regs(event)) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
+				return -EINVAL;
+			/* Not require any vector registers but set width */
+			if (event->attr.sample_simd_vec_reg_qwords &&
+			    !event->attr.sample_simd_vec_reg_intr &&
+			    !event->attr.sample_simd_vec_reg_user)
+				return -EINVAL;
+			/* The vector registers set is not supported */
+			if (event_needs_xmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
+				return -EINVAL;
 		}
 	}
 
@@ -1760,6 +1776,7 @@ x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
 	struct x86_perf_regs *x86_regs_user = this_cpu_ptr(&x86_user_regs);
 	struct perf_regs regs_user;
 
+	x86_regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
 	perf_get_regs_user(&regs_user, regs);
 	data->regs_user.abi = regs_user.abi;
 	if (regs_user.regs) {
@@ -1772,9 +1789,26 @@ x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
 
 static bool x86_pmu_user_req_pt_regs_only(struct perf_event *event)
 {
+	if (event->attr.sample_simd_regs_enabled)
+		return false;
 	return !(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK);
 }
 
+static inline void
+x86_pmu_update_ext_regs_size(struct perf_event_attr *attr,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs,
+			     u64 mask, u16 pred_mask)
+{
+	u16 pred_qwords = attr->sample_simd_pred_reg_qwords;
+	u16 vec_qwords = attr->sample_simd_vec_reg_qwords;
+	u64 pred_bitmap = pred_mask;
+	u64 bitmap = mask;
+
+	data->dyn_size += (hweight64(bitmap) * vec_qwords +
+			   hweight64(pred_bitmap) * pred_qwords) * sizeof(u64);
+}
+
 inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 {
 	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
@@ -1795,6 +1829,7 @@ static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		perf_regs->abi = PERF_SAMPLE_REGS_ABI_NONE;
 
 		if (user_mode(regs)) {
 			data->regs_user.abi = perf_reg_abi(current);
@@ -1817,17 +1852,24 @@ static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
 		data->dyn_size += sizeof(u64);
 		if (data->regs_user.regs)
 			data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
+		perf_regs->abi |= data->regs_user.abi;
+		if (attr->sample_simd_regs_enabled)
+			perf_regs->abi |= PERF_SAMPLE_REGS_ABI_SIMD;
 		data->sample_flags |= PERF_SAMPLE_REGS_USER;
 	}
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		perf_regs->abi = PERF_SAMPLE_REGS_ABI_NONE;
 
 		data->regs_intr.regs = regs;
 		data->regs_intr.abi = perf_reg_abi(current);
 		data->dyn_size += sizeof(u64);
 		if (data->regs_intr.regs)
 			data->dyn_size += hweight64(attr->sample_regs_intr) * sizeof(u64);
+		perf_regs->abi |= data->regs_intr.abi;
+		if (attr->sample_simd_regs_enabled)
+			perf_regs->abi |= PERF_SAMPLE_REGS_ABI_SIMD;
 		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
 	}
 }
@@ -1839,7 +1881,7 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
 	u64 mask = 0;
 
-	if (event_has_extended_regs(event))
+	if (event_needs_xmm(event))
 		mask |= XFEATURE_MASK_SSE;
 
 	mask &= ~ignore_mask;
@@ -1847,6 +1889,39 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 		x86_pmu_get_ext_regs(perf_regs, mask);
 }
 
+static void x86_pmu_setup_extended_regs_data(struct perf_event *event,
+					     struct perf_sample_data *data,
+					     struct pt_regs *regs)
+{
+	struct perf_event_attr *attr = &event->attr;
+	u64 sample_type = attr->sample_type;
+
+	if (!attr->sample_simd_regs_enabled)
+		return;
+
+	if (!(attr->sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)))
+		return;
+
+	/* Update the data[] size */
+	if (sample_type & PERF_SAMPLE_REGS_USER && data->regs_user.abi) {
+		/* num and qwords of vector and pred registers */
+		data->dyn_size += sizeof(u64);
+		data->regs_user.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		x86_pmu_update_ext_regs_size(attr, data, data->regs_user.regs,
+					     attr->sample_simd_vec_reg_user,
+					     attr->sample_simd_pred_reg_user);
+	}
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR && data->regs_intr.abi) {
+		/* num and qwords of vector and pred registers */
+		data->dyn_size += sizeof(u64);
+		data->regs_intr.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		x86_pmu_update_ext_regs_size(attr, data, data->regs_intr.regs,
+					     attr->sample_simd_vec_reg_intr,
+					     attr->sample_simd_pred_reg_intr);
+	}
+}
+
 void x86_pmu_setup_regs_data(struct perf_event *event,
 			     struct perf_sample_data *data,
 			     struct pt_regs *regs,
@@ -1858,6 +1933,7 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 	 * which is unnessary to sample again.
 	 */
 	x86_pmu_sample_ext_regs(event, regs, ignore_mask);
+	x86_pmu_setup_extended_regs_data(event, data, regs);
 }
 
 int x86_pmu_handle_irq(struct pt_regs *regs)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index af462f69cd1c..79cba323eeb1 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1473,7 +1473,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if (event_has_extended_regs(event))
+	if (event_needs_xmm(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 3c470d79aa65..e5d8ad024553 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -133,6 +133,18 @@ static inline bool is_acr_event_group(struct perf_event *event)
 	return check_leader_group(event->group_leader, PERF_X86_EVENT_ACR);
 }
 
+static inline bool event_needs_xmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_XMM_QWORDS)
+		return true;
+
+	if (!event->attr.sample_simd_regs_enabled &&
+	    event_has_extended_regs(event))
+		return true;
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 3b368de9f803..5d623805bf87 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -704,6 +704,7 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
+	u64		abi;
 	union {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..c3862e5fdd6d 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -55,4 +55,21 @@ enum perf_event_x86_regs {
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
 
+enum {
+	PERF_REG_X86_XMM,
+	PERF_REG_X86_MAX_SIMD_REGS,
+};
+
+enum {
+	PERF_X86_SIMD_XMM_REGS      = 16,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
+};
+
+#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
+
+enum {
+	PERF_X86_XMM_QWORDS      = 2,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
+};
+
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 81204cb7f723..9947a6b5c260 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -63,6 +63,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
+			return 0;
 		if (!perf_regs->xmm_regs)
 			return 0;
 		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
@@ -74,6 +77,51 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred)
+{
+	struct x86_perf_regs *perf_regs =
+			container_of(regs, struct x86_perf_regs, regs);
+
+	if (pred)
+		return 0;
+
+	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
+			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
+		return 0;
+
+	if (qwords_idx < PERF_X86_XMM_QWORDS) {
+		if (!perf_regs->xmm_regs)
+			return 0;
+		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS +
+					   qwords_idx];
+	}
+
+	return 0;
+}
+
+int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+			   u16 pred_qwords, u32 pred_mask)
+{
+	/* pred_qwords implies sample_simd_{pred,vec}_reg_* are supported */
+	if (!pred_qwords)
+		return 0;
+
+	if (!vec_qwords) {
+		if (vec_mask)
+			return -EINVAL;
+	} else {
+		if (vec_qwords != PERF_X86_XMM_QWORDS)
+			return -EINVAL;
+		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
+			return -EINVAL;
+	}
+	if (pred_mask)
+		return -EINVAL;
+
+	return 0;
+}
+
 #define PERF_REG_X86_RESERVED	(((1ULL << PERF_REG_X86_XMM0) - 1) & \
 				 ~((1ULL << PERF_REG_X86_MAX) - 1))
 
@@ -108,7 +156,8 @@ u64 perf_reg_abi(struct task_struct *task)
 
 int perf_reg_validate(u64 mask)
 {
-	if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
+	/* The mask could be 0 if only the SIMD registers are interested */
+	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
 		return -EINVAL;
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 09/19] perf/x86: Enable YMM sampling using sample_simd_vec_reg_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (7 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 10/19] perf/x86: Enable ZMM " Dapeng Mi
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch introduces support for sampling YMM registers via the
sample_simd_vec_reg_* fields.

Each YMM register consists of 4 u64 words, assembled from two halves:
XMM (the lower 2 u64 words) and YMMH (the upper 2 u64 words). Although
both XMM and YMMH data can be retrieved with a single xsaves instruction,
they are stored in separate locations. The perf_simd_reg_value() function
is responsible for assembling these halves into a complete YMM register
for output to userspace.

Additionally, sample_simd_vec_reg_qwords should be set to 4 to indicate
YMM sampling.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 9 +++++++++
 arch/x86/events/perf_event.h          | 9 +++++++++
 arch/x86/include/asm/perf_event.h     | 4 ++++
 arch/x86/include/uapi/asm/perf_regs.h | 8 ++++++--
 arch/x86/kernel/perf_regs.c           | 8 +++++++-
 5 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 8f7e7e81daaf..b1e62c061d9e 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -423,6 +423,9 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 
 	if (valid_mask & XFEATURE_MASK_SSE)
 		perf_regs->xmm_space = xsave->i387.xmm_space;
+
+	if (valid_mask & XFEATURE_MASK_YMM)
+		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
 }
 
 static void release_ext_regs_buffers(void)
@@ -735,6 +738,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_xmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
 				return -EINVAL;
+			if (event_needs_ymm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
+				return -EINVAL;
 		}
 	}
 
@@ -1814,6 +1820,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
 
 	perf_regs->xmm_regs = NULL;
+	perf_regs->ymmh_regs = NULL;
 }
 
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
@@ -1883,6 +1890,8 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 
 	if (event_needs_xmm(event))
 		mask |= XFEATURE_MASK_SSE;
+	if (event_needs_ymm(event))
+		mask |= XFEATURE_MASK_YMM;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e5d8ad024553..3d4577a1bb7d 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -145,6 +145,15 @@ static inline bool event_needs_xmm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_ymm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS)
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 5d623805bf87..25f5ae60f72f 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -709,6 +709,10 @@ struct x86_perf_regs {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
 	};
+	union {
+		u64	*ymmh_regs;
+		struct ymmh_struct *ymmh;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index c3862e5fdd6d..4fd598785f6d 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -57,19 +57,23 @@ enum perf_event_x86_regs {
 
 enum {
 	PERF_REG_X86_XMM,
+	PERF_REG_X86_YMM,
 	PERF_REG_X86_MAX_SIMD_REGS,
 };
 
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
-	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
+	PERF_X86_SIMD_YMM_REGS      = 16,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_YMM_REGS,
 };
 
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
 enum {
 	PERF_X86_XMM_QWORDS      = 2,
-	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
+	PERF_X86_YMMH_QWORDS     = 2,
+	PERF_X86_YMM_QWORDS      = 4,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_YMM_QWORDS,
 };
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 9947a6b5c260..8aa61a18fd71 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -95,6 +95,11 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			return 0;
 		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS +
 					   qwords_idx];
+	} else if (qwords_idx < PERF_X86_YMM_QWORDS) {
+		if (!perf_regs->ymmh_regs)
+			return 0;
+		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS +
+					    qwords_idx - PERF_X86_XMM_QWORDS];
 	}
 
 	return 0;
@@ -111,7 +116,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 		if (vec_mask)
 			return -EINVAL;
 	} else {
-		if (vec_qwords != PERF_X86_XMM_QWORDS)
+		if (vec_qwords != PERF_X86_XMM_QWORDS &&
+		    vec_qwords != PERF_X86_YMM_QWORDS)
 			return -EINVAL;
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 10/19] perf/x86: Enable ZMM sampling using sample_simd_vec_reg_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (8 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 09/19] perf/x86: Enable YMM " Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 11/19] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for sampling ZMM registers via the
sample_simd_vec_reg_* fields.

Each ZMM register consists of 8 u64 words. Current x86 hardware supports
up to 32 ZMM registers. For ZMM registers from ZMM0 to ZMM15, they are
assembled from three parts: XMM (the lower 2 u64 words),
YMMH (the middle 2 u64 words), and ZMMH (the upper 4 u64 words). The
perf_simd_reg_value() function is responsible for assembling these three
parts into a complete ZMM register for output to userspace.

For ZMM registers ZMM16 to ZMM31, each register can be read as a whole
and directly outputted to userspace.

Additionally, sample_simd_vec_reg_qwords should be set to 8 to indicate
ZMM sampling.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 16 ++++++++++++++++
 arch/x86/events/perf_event.h          | 19 +++++++++++++++++++
 arch/x86/include/asm/perf_event.h     |  8 ++++++++
 arch/x86/include/uapi/asm/perf_regs.h | 11 +++++++++--
 arch/x86/kernel/perf_regs.c           | 15 ++++++++++++++-
 5 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index b1e62c061d9e..d9c2cab5dcb9 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -426,6 +426,10 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 
 	if (valid_mask & XFEATURE_MASK_YMM)
 		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
+	if (valid_mask & XFEATURE_MASK_ZMM_Hi256)
+		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
+	if (valid_mask & XFEATURE_MASK_Hi16_ZMM)
+		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 }
 
 static void release_ext_regs_buffers(void)
@@ -741,6 +745,12 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_ymm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
 				return -EINVAL;
+			if (event_needs_low16_zmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_ZMM_Hi256))
+				return -EINVAL;
+			if (event_needs_high16_zmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_Hi16_ZMM))
+				return -EINVAL;
 		}
 	}
 
@@ -1821,6 +1831,8 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 
 	perf_regs->xmm_regs = NULL;
 	perf_regs->ymmh_regs = NULL;
+	perf_regs->zmmh_regs = NULL;
+	perf_regs->h16zmm_regs = NULL;
 }
 
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
@@ -1892,6 +1904,10 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_SSE;
 	if (event_needs_ymm(event))
 		mask |= XFEATURE_MASK_YMM;
+	if (event_needs_low16_zmm(event))
+		mask |= XFEATURE_MASK_ZMM_Hi256;
+	if (event_needs_high16_zmm(event))
+		mask |= XFEATURE_MASK_Hi16_ZMM;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 3d4577a1bb7d..9a871809a4aa 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -154,6 +154,25 @@ static inline bool event_needs_ymm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_low16_zmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_ZMM_QWORDS)
+		return true;
+
+	return false;
+}
+
+static inline bool event_needs_high16_zmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+	     fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 25f5ae60f72f..e4d9a8ba3e95 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -713,6 +713,14 @@ struct x86_perf_regs {
 		u64	*ymmh_regs;
 		struct ymmh_struct *ymmh;
 	};
+	union {
+		u64	*zmmh_regs;
+		struct avx_512_zmm_uppers_state *zmmh;
+	};
+	union {
+		u64	*h16zmm_regs;
+		struct avx_512_hi16_state *h16zmm;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 4fd598785f6d..96db454c7923 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -58,22 +58,29 @@ enum perf_event_x86_regs {
 enum {
 	PERF_REG_X86_XMM,
 	PERF_REG_X86_YMM,
+	PERF_REG_X86_ZMM,
 	PERF_REG_X86_MAX_SIMD_REGS,
 };
 
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
 	PERF_X86_SIMD_YMM_REGS      = 16,
-	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_YMM_REGS,
+	PERF_X86_SIMD_ZMMH_REGS     = 16,
+	PERF_X86_SIMD_ZMM_REGS      = 32,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
 };
 
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
+#define PERF_X86_H16ZMM_BASE		PERF_X86_SIMD_ZMMH_REGS
+
 enum {
 	PERF_X86_XMM_QWORDS      = 2,
 	PERF_X86_YMMH_QWORDS     = 2,
 	PERF_X86_YMM_QWORDS      = 4,
-	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_YMM_QWORDS,
+	PERF_X86_ZMMH_QWORDS     = 4,
+	PERF_X86_ZMM_QWORDS      = 8,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
 };
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 8aa61a18fd71..0a3ffaaea3aa 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -90,6 +90,13 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
 		return 0;
 
+	if (idx >= PERF_X86_H16ZMM_BASE) {
+		if (!perf_regs->h16zmm_regs)
+			return 0;
+		return perf_regs->h16zmm_regs[(idx - PERF_X86_H16ZMM_BASE) *
+					PERF_X86_ZMM_QWORDS + qwords_idx];
+	}
+
 	if (qwords_idx < PERF_X86_XMM_QWORDS) {
 		if (!perf_regs->xmm_regs)
 			return 0;
@@ -100,6 +107,11 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			return 0;
 		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS +
 					    qwords_idx - PERF_X86_XMM_QWORDS];
+	} else if (qwords_idx < PERF_X86_ZMM_QWORDS) {
+		if (!perf_regs->zmmh_regs)
+			return 0;
+		return perf_regs->zmmh_regs[idx * PERF_X86_ZMMH_QWORDS +
+					    qwords_idx - PERF_X86_YMM_QWORDS];
 	}
 
 	return 0;
@@ -117,7 +129,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 			return -EINVAL;
 	} else {
 		if (vec_qwords != PERF_X86_XMM_QWORDS &&
-		    vec_qwords != PERF_X86_YMM_QWORDS)
+		    vec_qwords != PERF_X86_YMM_QWORDS &&
+		    vec_qwords != PERF_X86_ZMM_QWORDS)
 			return -EINVAL;
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 11/19] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (9 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 10/19] perf/x86: Enable ZMM " Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields Dapeng Mi
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for sampling OPAMSK registers via the
sample_simd_pred_reg_* fields.

Each OPMASK register consists of 1 u64 word. Current x86 hardware
supports 8 OPMASK registers. The perf_simd_reg_value() function is
responsible for outputting OPMASK value to userspace.

Additionally, sample_simd_pred_reg_qwords should be set to 1 to indicate
OPMASK sampling.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                |  8 ++++++++
 arch/x86/events/perf_event.h          | 10 ++++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  8 ++++++++
 arch/x86/kernel/perf_regs.c           | 15 ++++++++++++---
 5 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d9c2cab5dcb9..3a4144ee0b7b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -430,6 +430,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
 	if (valid_mask & XFEATURE_MASK_Hi16_ZMM)
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
+	if (valid_mask & XFEATURE_MASK_OPMASK)
+		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 }
 
 static void release_ext_regs_buffers(void)
@@ -751,6 +753,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_high16_zmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_Hi16_ZMM))
 				return -EINVAL;
+			if (event_needs_opmask(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_OPMASK))
+				return -EINVAL;
 		}
 	}
 
@@ -1833,6 +1838,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->ymmh_regs = NULL;
 	perf_regs->zmmh_regs = NULL;
 	perf_regs->h16zmm_regs = NULL;
+	perf_regs->opmask_regs = NULL;
 }
 
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
@@ -1908,6 +1914,8 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_ZMM_Hi256;
 	if (event_needs_high16_zmm(event))
 		mask |= XFEATURE_MASK_Hi16_ZMM;
+	if (event_needs_opmask(event))
+		mask |= XFEATURE_MASK_OPMASK;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 9a871809a4aa..7e081a392ff8 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -173,6 +173,16 @@ static inline bool event_needs_high16_zmm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_opmask(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_simd_pred_reg_intr ||
+	     event->attr.sample_simd_pred_reg_user))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index e4d9a8ba3e95..caa6df8ac1cd 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -721,6 +721,10 @@ struct x86_perf_regs {
 		u64	*h16zmm_regs;
 		struct avx_512_hi16_state *h16zmm;
 	};
+	union {
+		u64	*opmask_regs;
+		struct avx_512_opmask_state *opmask;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 96db454c7923..6f29fd9495a2 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -60,6 +60,9 @@ enum {
 	PERF_REG_X86_YMM,
 	PERF_REG_X86_ZMM,
 	PERF_REG_X86_MAX_SIMD_REGS,
+
+	PERF_REG_X86_OPMASK = 0,
+	PERF_REG_X86_MAX_PRED_REGS = 1,
 };
 
 enum {
@@ -68,13 +71,18 @@ enum {
 	PERF_X86_SIMD_ZMMH_REGS     = 16,
 	PERF_X86_SIMD_ZMM_REGS      = 32,
 	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
+
+	PERF_X86_SIMD_OPMASK_REGS   = 8,
+	PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
 };
 
+#define PERF_X86_SIMD_PRED_MASK		GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
 #define PERF_X86_H16ZMM_BASE		PERF_X86_SIMD_ZMMH_REGS
 
 enum {
+	PERF_X86_OPMASK_QWORDS   = 1,
 	PERF_X86_XMM_QWORDS      = 2,
 	PERF_X86_YMMH_QWORDS     = 2,
 	PERF_X86_YMM_QWORDS      = 4,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 0a3ffaaea3aa..1ca24e2a6aa0 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -83,8 +83,14 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 	struct x86_perf_regs *perf_regs =
 			container_of(regs, struct x86_perf_regs, regs);
 
-	if (pred)
-		return 0;
+	if (pred) {
+		if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_PRED_REGS_MAX ||
+				 qwords_idx >= PERF_X86_OPMASK_QWORDS))
+			return 0;
+		if (!perf_regs->opmask_regs)
+			return 0;
+		return perf_regs->opmask_regs[idx];
+	}
 
 	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
@@ -135,7 +141,10 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
 	}
-	if (pred_mask)
+
+	if (pred_qwords != PERF_X86_OPMASK_QWORDS)
+		return -EINVAL;
+	if (pred_mask & ~PERF_X86_SIMD_PRED_MASK)
 		return -EINVAL;
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (10 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 11/19] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-05 12:16   ` Peter Zijlstra
  2025-12-03  6:54 ` [Patch v5 13/19] perf/x86: Enable SSP " Dapeng Mi
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch enables sampling of APX eGPRs (R16 ~ R31) via the
sample_regs_* fields.

To sample eGPRs, the sample_simd_regs_enabled field must be set. This
allows the spare space (reclaimed from the original XMM space) in the
sample_regs_* fields to be used for representing eGPRs.

The perf_reg_value() function needs to check if the
PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then determine whether
to output eGPRs or legacy XMM registers to userspace.

The perf_reg_validate() function is enhanced to validate the eGPRs bitmap
by adding a new argument, "simd_enabled".

Currently, eGPRs sampling is only supported on the x86_64 architecture, as
APX is only available on x86_64 platforms.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/arm/kernel/perf_regs.c           |  2 +-
 arch/arm64/kernel/perf_regs.c         |  2 +-
 arch/csky/kernel/perf_regs.c          |  2 +-
 arch/loongarch/kernel/perf_regs.c     |  2 +-
 arch/mips/kernel/perf_regs.c          |  2 +-
 arch/parisc/kernel/perf_regs.c        |  2 +-
 arch/powerpc/perf/perf_regs.c         |  2 +-
 arch/riscv/kernel/perf_regs.c         |  2 +-
 arch/s390/kernel/perf_regs.c          |  2 +-
 arch/x86/events/core.c                | 41 +++++++++++++++-------
 arch/x86/events/perf_event.h          | 10 ++++++
 arch/x86/include/asm/perf_event.h     |  4 +++
 arch/x86/include/uapi/asm/perf_regs.h | 25 ++++++++++++++
 arch/x86/kernel/perf_regs.c           | 49 +++++++++++++++------------
 include/linux/perf_regs.h             |  2 +-
 kernel/events/core.c                  |  8 +++--
 16 files changed, 110 insertions(+), 47 deletions(-)

diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index d575a4c3ca56..838d701adf4d 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_ARM_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index 70e2f13f587f..71a3e0238de4 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -77,7 +77,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_ARM64_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	u64 reserved_mask = REG_RESERVED;
 
diff --git a/arch/csky/kernel/perf_regs.c b/arch/csky/kernel/perf_regs.c
index 94601f37b596..c932a96afc56 100644
--- a/arch/csky/kernel/perf_regs.c
+++ b/arch/csky/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_CSKY_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/loongarch/kernel/perf_regs.c b/arch/loongarch/kernel/perf_regs.c
index 8dd604f01745..164514f40ae0 100644
--- a/arch/loongarch/kernel/perf_regs.c
+++ b/arch/loongarch/kernel/perf_regs.c
@@ -25,7 +25,7 @@ u64 perf_reg_abi(struct task_struct *tsk)
 }
 #endif /* CONFIG_32BIT */
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask)
 		return -EINVAL;
diff --git a/arch/mips/kernel/perf_regs.c b/arch/mips/kernel/perf_regs.c
index 7736d3c5ebd2..00a5201dbd5d 100644
--- a/arch/mips/kernel/perf_regs.c
+++ b/arch/mips/kernel/perf_regs.c
@@ -28,7 +28,7 @@ u64 perf_reg_abi(struct task_struct *tsk)
 }
 #endif /* CONFIG_32BIT */
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask)
 		return -EINVAL;
diff --git a/arch/parisc/kernel/perf_regs.c b/arch/parisc/kernel/perf_regs.c
index 87e6990569a7..169c25c054b2 100644
--- a/arch/parisc/kernel/perf_regs.c
+++ b/arch/parisc/kernel/perf_regs.c
@@ -34,7 +34,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_PARISC_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index 350dccb0143c..a01d8a903640 100644
--- a/arch/powerpc/perf/perf_regs.c
+++ b/arch/powerpc/perf/perf_regs.c
@@ -125,7 +125,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/riscv/kernel/perf_regs.c b/arch/riscv/kernel/perf_regs.c
index 3bba8deababb..1ecc8760b88b 100644
--- a/arch/riscv/kernel/perf_regs.c
+++ b/arch/riscv/kernel/perf_regs.c
@@ -18,7 +18,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1ULL << PERF_REG_RISCV_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/s390/kernel/perf_regs.c b/arch/s390/kernel/perf_regs.c
index a6b058ee4a36..c5ad9e2f489b 100644
--- a/arch/s390/kernel/perf_regs.c
+++ b/arch/s390/kernel/perf_regs.c
@@ -34,7 +34,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 
 #define REG_RESERVED (~((1UL << PERF_REG_S390_MAX) - 1))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || mask & REG_RESERVED)
 		return -EINVAL;
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 3a4144ee0b7b..ec0838469cae 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -432,6 +432,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 	if (valid_mask & XFEATURE_MASK_OPMASK)
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
+	if (valid_mask & XFEATURE_MASK_APX)
+		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
 }
 
 static void release_ext_regs_buffers(void)
@@ -719,22 +721,21 @@ int x86_pmu_hw_config(struct perf_event *event)
 	}
 
 	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
-		/*
-		 * Besides the general purpose registers, XMM registers may
-		 * be collected as well.
-		 */
-		if (event_has_extended_regs(event)) {
-			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
-				return -EINVAL;
-			if (!event->attr.precise_ip)
-				return -EINVAL;
-			if (event->attr.sample_simd_regs_enabled)
-				return -EINVAL;
-		}
-
 		if (event_has_simd_regs(event)) {
+			u64 reserved = ~GENMASK_ULL(PERF_REG_MISC_MAX - 1, 0);
+
 			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
 				return -EINVAL;
+			/*
+			 * The XMM space in the perf_event_x86_regs is reclaimed
+			 * for eGPRs and other general registers.
+			 */
+			if (event->attr.sample_regs_user & reserved ||
+			    event->attr.sample_regs_intr & reserved)
+				return -EINVAL;
+			if (event_needs_egprs(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
+				return -EINVAL;
 			/* Not require any vector registers but set width */
 			if (event->attr.sample_simd_vec_reg_qwords &&
 			    !event->attr.sample_simd_vec_reg_intr &&
@@ -756,6 +757,17 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_opmask(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_OPMASK))
 				return -EINVAL;
+		} else {
+			/*
+			 * Besides the general purpose registers, XMM registers may
+			 * be collected as well.
+			 */
+			if (event_has_extended_regs(event)) {
+				if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+					return -EINVAL;
+				if (!event->attr.precise_ip)
+					return -EINVAL;
+			}
 		}
 	}
 
@@ -1839,6 +1851,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->zmmh_regs = NULL;
 	perf_regs->h16zmm_regs = NULL;
 	perf_regs->opmask_regs = NULL;
+	perf_regs->egpr_regs = NULL;
 }
 
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
@@ -1916,6 +1929,8 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_Hi16_ZMM;
 	if (event_needs_opmask(event))
 		mask |= XFEATURE_MASK_OPMASK;
+	if (event_needs_egprs(event))
+		mask |= XFEATURE_MASK_APX;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7e081a392ff8..9fb1cbbc1b76 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -183,6 +183,16 @@ static inline bool event_needs_opmask(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_egprs(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_regs_user & PERF_X86_EGPRS_MASK ||
+	     event->attr.sample_regs_intr & PERF_X86_EGPRS_MASK))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index caa6df8ac1cd..ca242db3720f 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -725,6 +725,10 @@ struct x86_perf_regs {
 		u64	*opmask_regs;
 		struct avx_512_opmask_state *opmask;
 	};
+	union {
+		u64	*egpr_regs;
+		struct apx_state *egpr;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 6f29fd9495a2..f145e3b78426 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,33 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/*
+	 * The EGPRs and XMM have overlaps. Only one can be used
+	 * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
+	 * utilize EGPRs. For the other ABI type, XMM is used.
+	 *
+	 * Extended GPRs (EGPRs)
+	 */
+	PERF_REG_X86_R16,
+	PERF_REG_X86_R17,
+	PERF_REG_X86_R18,
+	PERF_REG_X86_R19,
+	PERF_REG_X86_R20,
+	PERF_REG_X86_R21,
+	PERF_REG_X86_R22,
+	PERF_REG_X86_R23,
+	PERF_REG_X86_R24,
+	PERF_REG_X86_R25,
+	PERF_REG_X86_R26,
+	PERF_REG_X86_R27,
+	PERF_REG_X86_R28,
+	PERF_REG_X86_R29,
+	PERF_REG_X86_R30,
+	PERF_REG_X86_R31,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	PERF_REG_MISC_MAX = PERF_REG_X86_R31 + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
@@ -54,6 +78,7 @@ enum perf_event_x86_regs {
 };
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
+#define PERF_X86_EGPRS_MASK	GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
 
 enum {
 	PERF_REG_X86_XMM,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 1ca24e2a6aa0..e76de39e1385 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -61,14 +61,22 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
 	struct x86_perf_regs *perf_regs;
 
-	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+	if (idx > PERF_REG_X86_R15) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
-		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
-		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
-			return 0;
-		if (!perf_regs->xmm_regs)
-			return 0;
-		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+			if (idx <= PERF_REG_X86_R31) {
+				if (!perf_regs->egpr_regs)
+					return 0;
+				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
+			}
+		} else {
+			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+				if (!perf_regs->xmm_regs)
+					return 0;
+				return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+			}
+		}
 	}
 
 	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
@@ -150,20 +158,14 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 	return 0;
 }
 
-#define PERF_REG_X86_RESERVED	(((1ULL << PERF_REG_X86_XMM0) - 1) & \
-				 ~((1ULL << PERF_REG_X86_MAX) - 1))
+#define PERF_REG_X86_RESERVED	(GENMASK_ULL(PERF_REG_X86_XMM0 - 1, PERF_REG_X86_AX) & \
+				 ~GENMASK_ULL(PERF_REG_X86_R15, PERF_REG_X86_AX))
+#define PERF_REG_X86_EXT_RESERVED	(~GENMASK_ULL(PERF_REG_MISC_MAX - 1, PERF_REG_X86_AX))
 
 #ifdef CONFIG_X86_32
-#define REG_NOSUPPORT ((1ULL << PERF_REG_X86_R8) | \
-		       (1ULL << PERF_REG_X86_R9) | \
-		       (1ULL << PERF_REG_X86_R10) | \
-		       (1ULL << PERF_REG_X86_R11) | \
-		       (1ULL << PERF_REG_X86_R12) | \
-		       (1ULL << PERF_REG_X86_R13) | \
-		       (1ULL << PERF_REG_X86_R14) | \
-		       (1ULL << PERF_REG_X86_R15))
-
-int perf_reg_validate(u64 mask)
+#define REG_NOSUPPORT GENMASK_ULL(PERF_REG_X86_R15, PERF_REG_X86_R8)
+
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
 		return -EINVAL;
@@ -182,10 +184,15 @@ u64 perf_reg_abi(struct task_struct *task)
 		       (1ULL << PERF_REG_X86_FS) | \
 		       (1ULL << PERF_REG_X86_GS))
 
-int perf_reg_validate(u64 mask)
+int perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	/* The mask could be 0 if only the SIMD registers are interested */
-	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
+	if (!simd_enabled &&
+	    (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
+		return -EINVAL;
+
+	if (simd_enabled &&
+	    (mask & (REG_NOSUPPORT | PERF_REG_X86_EXT_RESERVED)))
 		return -EINVAL;
 
 	return 0;
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 518f28c6a7d4..09dbc2fc3859 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -10,7 +10,7 @@ struct perf_regs {
 };
 
 u64 perf_reg_value(struct pt_regs *regs, int idx);
-int perf_reg_validate(u64 mask);
+int perf_reg_validate(u64 mask, bool simd_enabled);
 u64 perf_reg_abi(struct task_struct *task);
 void perf_get_regs_user(struct perf_regs *regs_user,
 			struct pt_regs *regs);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b19de038979e..428ff39e03c5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7436,7 +7436,7 @@ u64 __weak perf_reg_value(struct pt_regs *regs, int idx)
 	return 0;
 }
 
-int __weak perf_reg_validate(u64 mask)
+int __weak perf_reg_validate(u64 mask, bool simd_enabled)
 {
 	return mask ? -ENOSYS : 0;
 }
@@ -13310,7 +13310,8 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_REGS_USER) {
-		ret = perf_reg_validate(attr->sample_regs_user);
+		ret = perf_reg_validate(attr->sample_regs_user,
+					attr->sample_simd_regs_enabled);
 		if (ret)
 			return ret;
 		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
@@ -13340,7 +13341,8 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		attr->sample_max_stack = sysctl_perf_event_max_stack;
 
 	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
-		ret = perf_reg_validate(attr->sample_regs_intr);
+		ret = perf_reg_validate(attr->sample_regs_intr,
+					attr->sample_simd_regs_enabled);
 		if (ret)
 			return ret;
 		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 13/19] perf/x86: Enable SSP sampling using sample_regs_* fields
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (11 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-05 12:20   ` Peter Zijlstra
  2025-12-24  5:45   ` Ravi Bangoria
  2025-12-03  6:54 ` [Patch v5 14/19] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
                   ` (7 subsequent siblings)
  20 siblings, 2 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch enables sampling of CET SSP register via the sample_regs_*
fields.

To sample SSP, the sample_simd_regs_enabled field must be set. This
allows the spare space (reclaimed from the original XMM space) in the
sample_regs_* fields to be used for representing SSP.

Similar with eGPRs sampling, the perf_reg_value() function needs to
check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then
determine whether to output SSP or legacy XMM registers to userspace.

Additionally, arch-PEBS supports sampling SSP, which is placed into the
GPRs group. This patch also enables arch-PEBS-based SSP sampling.

Currently, SSP sampling is only supported on the x86_64 architecture, as
CET is only available on x86_64 platforms.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                |  9 +++++++++
 arch/x86/events/intel/ds.c            |  3 +++
 arch/x86/events/perf_event.h          | 10 ++++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  7 ++++---
 arch/x86/kernel/perf_regs.c           |  5 +++++
 6 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index ec0838469cae..b6030dae561d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -434,6 +434,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 	if (valid_mask & XFEATURE_MASK_APX)
 		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
+	if (valid_mask & XFEATURE_MASK_CET_USER)
+		perf_regs->cet = get_xsave_addr(xsave, XFEATURE_CET_USER);
 }
 
 static void release_ext_regs_buffers(void)
@@ -736,6 +738,10 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_egprs(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
 				return -EINVAL;
+			if (event_needs_ssp(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_CET_USER))
+				return -EINVAL;
+
 			/* Not require any vector registers but set width */
 			if (event->attr.sample_simd_vec_reg_qwords &&
 			    !event->attr.sample_simd_vec_reg_intr &&
@@ -1852,6 +1858,7 @@ inline void x86_pmu_clear_perf_regs(struct pt_regs *regs)
 	perf_regs->h16zmm_regs = NULL;
 	perf_regs->opmask_regs = NULL;
 	perf_regs->egpr_regs = NULL;
+	perf_regs->cet_regs = NULL;
 }
 
 static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
@@ -1931,6 +1938,8 @@ static void x86_pmu_sample_ext_regs(struct perf_event *event,
 		mask |= XFEATURE_MASK_OPMASK;
 	if (event_needs_egprs(event))
 		mask |= XFEATURE_MASK_APX;
+	if (event_needs_ssp(event))
+		mask |= XFEATURE_MASK_CET_USER;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 79cba323eeb1..3212259d1a16 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2409,12 +2409,15 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 	}
 
 	if (header->gpr) {
+		ignore_mask = XFEATURE_MASK_CET_USER;
+
 		gprs = next_record;
 		next_record = gprs + 1;
 
 		__setup_pebs_gpr_group(event, data, regs,
 				       (struct pebs_gprs *)gprs,
 				       sample_type);
+		perf_regs->cet_regs = &gprs->r15;
 	}
 
 	if (header->aux) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 9fb1cbbc1b76..35a1837d0b77 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -193,6 +193,16 @@ static inline bool event_needs_egprs(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_ssp(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP) ||
+	     event->attr.sample_regs_intr & BIT_ULL(PERF_REG_X86_SSP)))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index ca242db3720f..c925af4160ad 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -729,6 +729,10 @@ struct x86_perf_regs {
 		u64	*egpr_regs;
 		struct apx_state *egpr;
 	};
+	union {
+		u64	*cet_regs;
+		struct cet_user_state *cet;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index f145e3b78426..f3561ed10041 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -28,9 +28,9 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
 	/*
-	 * The EGPRs and XMM have overlaps. Only one can be used
+	 * The EGPRs/SSP and XMM have overlaps. Only one can be used
 	 * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
-	 * utilize EGPRs. For the other ABI type, XMM is used.
+	 * utilize EGPRs/SSP. For the other ABI type, XMM is used.
 	 *
 	 * Extended GPRs (EGPRs)
 	 */
@@ -50,10 +50,11 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R29,
 	PERF_REG_X86_R30,
 	PERF_REG_X86_R31,
+	PERF_REG_X86_SSP,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
-	PERF_REG_MISC_MAX = PERF_REG_X86_R31 + 1,
+	PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index e76de39e1385..518bbe577ee8 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -70,6 +70,11 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 					return 0;
 				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
 			}
+			if (idx == PERF_REG_X86_SSP) {
+				if (!perf_regs->cet)
+					return 0;
+				return perf_regs->cet->user_ssp;
+			}
 		} else {
 			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 				if (!perf_regs->xmm_regs)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 14/19] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (12 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 13/19] perf/x86: Enable SSP " Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 15/19] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Enable the PERF_PMU_CAP_SIMD_REGS capability if XSAVES support is
available for YMM, ZMM, OPMASK, eGPRs, or SSP.

Temporarily disable large PEBS sampling for these registers, as the
current arch-PEBS sampling code does not support them yet. Large PEBS
sampling for these registers will be enabled in subsequent patches.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c | 50 +++++++++++++++++++++++++++++++++---
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index b5c89e8eabb2..d8cc7abfcdc6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4160,10 +4160,32 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 		flags &= ~PERF_SAMPLE_TIME;
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
-		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_intr & ~PEBS_GP_REGS)
-		flags &= ~PERF_SAMPLE_REGS_INTR;
+	if (event->attr.sample_simd_regs_enabled) {
+		u64 nolarge = PERF_X86_EGPRS_MASK | BIT_ULL(PERF_REG_X86_SSP);
+
+		/*
+		 * PEBS HW can only collect the XMM0-XMM15 for now.
+		 * Disable large PEBS for other vector registers, predicate
+		 * registers, eGPRs, and SSP.
+		 */
+		if (event->attr.sample_regs_user & nolarge ||
+		    fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE ||
+		    event->attr.sample_simd_pred_reg_user)
+			flags &= ~PERF_SAMPLE_REGS_USER;
+
+		if (event->attr.sample_regs_intr & nolarge ||
+		    fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+		    event->attr.sample_simd_pred_reg_intr)
+			flags &= ~PERF_SAMPLE_REGS_INTR;
+
+		if (event->attr.sample_simd_vec_reg_qwords > PERF_X86_XMM_QWORDS)
+			flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
+	} else {
+		if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
+			flags &= ~PERF_SAMPLE_REGS_USER;
+		if (event->attr.sample_regs_intr & ~PEBS_GP_REGS)
+			flags &= ~PERF_SAMPLE_REGS_INTR;
+	}
 	return flags;
 }
 
@@ -5643,6 +5665,26 @@ static void intel_extended_regs_init(struct pmu *pmu)
 
 	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
 	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+
+	if (boot_cpu_has(X86_FEATURE_AVX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_YMM, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_YMM;
+	if (boot_cpu_has(X86_FEATURE_APX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_APX, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_APX;
+	if (boot_cpu_has(X86_FEATURE_AVX512F)) {
+		if (cpu_has_xfeatures(XFEATURE_MASK_OPMASK, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_OPMASK;
+		if (cpu_has_xfeatures(XFEATURE_MASK_ZMM_Hi256, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_ZMM_Hi256;
+		if (cpu_has_xfeatures(XFEATURE_MASK_Hi16_ZMM, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_Hi16_ZMM;
+	}
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_CET_USER;
+
+	if (x86_pmu.ext_regs_mask != XFEATURE_MASK_SSE)
+		x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_SIMD_REGS;
 }
 
 #define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 15/19] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (13 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 14/19] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03  6:54 ` [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

This patch enables arch-PEBS based SIMD/eGPRs/SSP registers sampling.

Arch-PEBS supports sampling of these registers, with all except SSP
placed into the XSAVE-Enabled Registers (XER) group with the layout
described below.

Field Name 	Registers Used 			Size
----------------------------------------------------------------------
XSTATE_BV	XINUSE for groups		8 B
----------------------------------------------------------------------
Reserved 	Reserved 			8 B
----------------------------------------------------------------------
SSER 		XMM0-XMM15 			16 regs * 16 B = 256 B
----------------------------------------------------------------------
YMMHIR 		Upper 128 bits of YMM0-YMM15 	16 regs * 16 B = 256 B
----------------------------------------------------------------------
EGPR 		R16-R31 			16 regs * 8 B = 128 B
----------------------------------------------------------------------
OPMASKR 	K0-K7 				8 regs * 8 B = 64 B
----------------------------------------------------------------------
ZMMHIR 		Upper 256 bits of ZMM0-ZMM15 	16 regs * 32 B = 512 B
----------------------------------------------------------------------
Hi16ZMMR 	ZMM16-ZMM31 			16 regs * 64 B = 1024 B
----------------------------------------------------------------------

Memory space in the output buffer is allocated for these sub-groups as
long as the corresponding Format.XER[55:49] bits in the PEBS record
header are set. However, the arch-PEBS hardware engine does not write
the sub-group if it is not used (in INIT state). In such cases, the
corresponding bit in the XSTATE_BV bitmap is set to 0. Therefore, the
XSTATE_BV field is checked to determine if the register data is actually
written for each PEBS record. If not, the register data is not outputted
to userspace.

The SSP register is sampled and placed into the GPRs group by arch-PEBS.

Additionally, the MSRs IA32_PMC_{GPn|FXm}_CFG_C.[55:49] bits are used to
manage which types of these registers need to be sampled.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c      | 71 +++++++++++++++++++++--------
 arch/x86/events/intel/ds.c        | 76 ++++++++++++++++++++++++++++---
 arch/x86/include/asm/msr-index.h  |  7 +++
 arch/x86/include/asm/perf_event.h |  8 +++-
 4 files changed, 137 insertions(+), 25 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d8cc7abfcdc6..da48bcde8fce 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3008,6 +3008,21 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
 			if (pebs_data_cfg & PEBS_DATACFG_XMMS)
 				ext |= ARCH_PEBS_VECR_XMM & cap.caps;
 
+			if (pebs_data_cfg & PEBS_DATACFG_YMMHS)
+				ext |= ARCH_PEBS_VECR_YMMH & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_EGPRS)
+				ext |= ARCH_PEBS_VECR_EGPRS & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
+				ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
+				ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
+				ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
+
 			if (pebs_data_cfg & PEBS_DATACFG_LBRS)
 				ext |= ARCH_PEBS_LBR & cap.caps;
 
@@ -4152,6 +4167,30 @@ static void intel_pebs_aliases_skl(struct perf_event *event)
 	return intel_pebs_aliases_precdist(event);
 }
 
+static inline bool intel_pebs_support_regs(struct perf_event *event, u64 regs)
+{
+	struct arch_pebs_cap cap = hybrid(event->pmu, arch_pebs_cap);
+	bool supported = true;
+
+	/* SSP */
+	if (regs & PEBS_DATACFG_GP)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_GPR & cap.caps);
+	if (regs & PEBS_DATACFG_XMMS)
+		supported &= x86_pmu.intel_cap.pebs_format > 3;
+	if (regs & PEBS_DATACFG_YMMHS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_YMMH & cap.caps);
+	if (regs & PEBS_DATACFG_EGPRS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_EGPRS & cap.caps);
+	if (regs & PEBS_DATACFG_OPMASKS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_OPMASK & cap.caps);
+	if (regs & PEBS_DATACFG_ZMMHS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_ZMMH & cap.caps);
+	if (regs & PEBS_DATACFG_H16ZMMS)
+		supported &= x86_pmu.arch_pebs && (ARCH_PEBS_VECR_H16ZMM & cap.caps);
+
+	return supported;
+}
+
 static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.large_pebs_flags;
@@ -4161,24 +4200,20 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
 	if (event->attr.sample_simd_regs_enabled) {
-		u64 nolarge = PERF_X86_EGPRS_MASK | BIT_ULL(PERF_REG_X86_SSP);
-
-		/*
-		 * PEBS HW can only collect the XMM0-XMM15 for now.
-		 * Disable large PEBS for other vector registers, predicate
-		 * registers, eGPRs, and SSP.
-		 */
-		if (event->attr.sample_regs_user & nolarge ||
-		    fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE ||
-		    event->attr.sample_simd_pred_reg_user)
-			flags &= ~PERF_SAMPLE_REGS_USER;
-
-		if (event->attr.sample_regs_intr & nolarge ||
-		    fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
-		    event->attr.sample_simd_pred_reg_intr)
-			flags &= ~PERF_SAMPLE_REGS_INTR;
-
-		if (event->attr.sample_simd_vec_reg_qwords > PERF_X86_XMM_QWORDS)
+		if ((event_needs_ssp(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_GP)) ||
+		    (event_needs_xmm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_XMMS)) ||
+		    (event_needs_ymm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_YMMHS)) ||
+		    (event_needs_egprs(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_EGPRS)) ||
+		    (event_needs_opmask(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_OPMASKS)) ||
+		    (event_needs_low16_zmm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_ZMMHS)) ||
+		    (event_needs_high16_zmm(event) &&
+		     !intel_pebs_support_regs(event, PEBS_DATACFG_H16ZMMS)))
 			flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
 	} else {
 		if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 3212259d1a16..a01c72c03bd6 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1470,11 +1470,21 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 		     ((attr->config & INTEL_ARCH_EVENT_MASK) ==
 		      x86_pmu.rtm_abort_event);
 
-	if (gprs || (attr->precise_ip < 2) || tsx_weight)
+	if (gprs || (attr->precise_ip < 2) || tsx_weight || event_needs_ssp(event))
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
 	if (event_needs_xmm(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
+	if (event_needs_ymm(event))
+		pebs_data_cfg |= PEBS_DATACFG_YMMHS;
+	if (event_needs_low16_zmm(event))
+		pebs_data_cfg |= PEBS_DATACFG_ZMMHS;
+	if (event_needs_high16_zmm(event))
+		pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
+	if (event_needs_opmask(event))
+		pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
+	if (event_needs_egprs(event))
+		pebs_data_cfg |= PEBS_DATACFG_EGPRS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
 		/*
@@ -2430,15 +2440,69 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 					   meminfo->tsx_tuning, ax);
 	}
 
-	if (header->xmm) {
+	if (header->xmm || header->ymmh || header->egpr ||
+	    header->opmask || header->zmmh || header->h16zmm) {
+		struct arch_pebs_xer_header *xer_header = next_record;
 		struct pebs_xmm *xmm;
+		struct ymmh_struct *ymmh;
+		struct avx_512_zmm_uppers_state *zmmh;
+		struct avx_512_hi16_state *h16zmm;
+		struct avx_512_opmask_state *opmask;
+		struct apx_state *egpr;
 
 		next_record += sizeof(struct arch_pebs_xer_header);
 
-		ignore_mask |= XFEATURE_MASK_SSE;
-		xmm = next_record;
-		perf_regs->xmm_regs = xmm->xmm;
-		next_record = xmm + 1;
+		if (header->xmm) {
+			ignore_mask |= XFEATURE_MASK_SSE;
+			xmm = next_record;
+			/*
+			 * Only output XMM regs to user space when arch-PEBS
+			 * really writes data into xstate area.
+			 */
+			if (xer_header->xstate & XFEATURE_MASK_SSE)
+				perf_regs->xmm_regs = xmm->xmm;
+			next_record = xmm + 1;
+		}
+
+		if (header->ymmh) {
+			ignore_mask |= XFEATURE_MASK_YMM;
+			ymmh = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_YMM)
+				perf_regs->ymmh = ymmh;
+			next_record = ymmh + 1;
+		}
+
+		if (header->egpr) {
+			ignore_mask |= XFEATURE_MASK_APX;
+			egpr = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_APX)
+				perf_regs->egpr = egpr;
+			next_record = egpr + 1;
+		}
+
+		if (header->opmask) {
+			ignore_mask |= XFEATURE_MASK_OPMASK;
+			opmask = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_OPMASK)
+				perf_regs->opmask = opmask;
+			next_record = opmask + 1;
+		}
+
+		if (header->zmmh) {
+			ignore_mask |= XFEATURE_MASK_ZMM_Hi256;
+			zmmh = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_ZMM_Hi256)
+				perf_regs->zmmh = zmmh;
+			next_record = zmmh + 1;
+		}
+
+		if (header->h16zmm) {
+			ignore_mask |= XFEATURE_MASK_Hi16_ZMM;
+			h16zmm = next_record;
+			if (xer_header->xstate & XFEATURE_MASK_Hi16_ZMM)
+				perf_regs->h16zmm = h16zmm;
+			next_record = h16zmm + 1;
+		}
 	}
 
 	if (header->lbr) {
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 65cc528fbad8..3f1cc294b1e9 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -341,6 +341,13 @@
 #define ARCH_PEBS_LBR_SHIFT		40
 #define ARCH_PEBS_LBR			(0x3ull << ARCH_PEBS_LBR_SHIFT)
 #define ARCH_PEBS_VECR_XMM		BIT_ULL(49)
+#define ARCH_PEBS_VECR_YMMH		BIT_ULL(50)
+#define ARCH_PEBS_VECR_EGPRS		BIT_ULL(51)
+#define ARCH_PEBS_VECR_OPMASK		BIT_ULL(53)
+#define ARCH_PEBS_VECR_ZMMH		BIT_ULL(54)
+#define ARCH_PEBS_VECR_H16ZMM		BIT_ULL(55)
+#define ARCH_PEBS_VECR_EXT_SHIFT	50
+#define ARCH_PEBS_VECR_EXT		(0x3full << ARCH_PEBS_VECR_EXT_SHIFT)
 #define ARCH_PEBS_GPR			BIT_ULL(61)
 #define ARCH_PEBS_AUX			BIT_ULL(62)
 #define ARCH_PEBS_EN			BIT_ULL(63)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index c925af4160ad..41668a4633df 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -146,6 +146,11 @@
 #define PEBS_DATACFG_LBRS	BIT_ULL(3)
 #define PEBS_DATACFG_CNTR	BIT_ULL(4)
 #define PEBS_DATACFG_METRICS	BIT_ULL(5)
+#define PEBS_DATACFG_YMMHS	BIT_ULL(6)
+#define PEBS_DATACFG_OPMASKS	BIT_ULL(7)
+#define PEBS_DATACFG_ZMMHS	BIT_ULL(8)
+#define PEBS_DATACFG_H16ZMMS	BIT_ULL(9)
+#define PEBS_DATACFG_EGPRS	BIT_ULL(10)
 #define PEBS_DATACFG_LBR_SHIFT	24
 #define PEBS_DATACFG_CNTR_SHIFT	32
 #define PEBS_DATACFG_CNTR_MASK	GENMASK_ULL(15, 0)
@@ -540,7 +545,8 @@ struct arch_pebs_header {
 			    rsvd3:7,
 			    xmm:1,
 			    ymmh:1,
-			    rsvd4:2,
+			    egpr:1,
+			    rsvd4:1,
 			    opmask:1,
 			    zmmh:1,
 			    h16zmm:1,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (14 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 15/19] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-05 12:39   ` Peter Zijlstra
  2025-12-03  6:54 ` [Patch v5 17/19] perf headers: Sync with the kernel headers Dapeng Mi
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Dapeng Mi

When two or more identical PEBS events with the same sampling period are
programmed on a mix of PDIST and non-PDIST counters, multiple
back-to-back NMIs can be triggered.

The Linux PMI handler processes the first NMI and clears the
GLOBAL_STATUS MSR. If a second NMI is triggered immediately after
the first, it is recognized as a "suspicious NMI" because no bits are set
in the GLOBAL_STATUS MSR (cleared by the first NMI).

This issue does not lead to PEBS data corruption or data loss, but it
does result in an annoying warning message.

The current NMI handler supports back-to-back NMI detection, but it
requires the PMI handler to return the count of actually processed events,
which the PEBS handler does not currently do.

This patch modifies the PEBS handler to return the count of actually
processed events, thereby activating back-to-back NMI detection and
avoiding the "suspicious NMI" warning.

Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c |  3 +--
 arch/x86/events/intel/ds.c   | 36 +++++++++++++++++++++++-------------
 arch/x86/events/perf_event.h |  2 +-
 3 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index da48bcde8fce..a130d3f14844 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3351,8 +3351,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 	 */
 	if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
 				 (unsigned long *)&status)) {
-		handled++;
-		static_call(x86_pmu_drain_pebs)(regs, &data);
+		handled += static_call(x86_pmu_drain_pebs)(regs, &data);
 
 		if (cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS] &&
 		    is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS]))
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index a01c72c03bd6..c7cdcd585574 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2759,7 +2759,7 @@ __intel_pmu_pebs_events(struct perf_event *event,
 	__intel_pmu_pebs_last_event(event, iregs, regs, data, at, count, setup_sample);
 }
 
-static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
@@ -2768,7 +2768,7 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
 	int n;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	at  = (struct pebs_record_core *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_record_core *)(unsigned long)ds->pebs_index;
@@ -2779,22 +2779,24 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
 	ds->pebs_index = ds->pebs_buffer_base;
 
 	if (!test_bit(0, cpuc->active_mask))
-		return;
+		return 0;
 
 	WARN_ON_ONCE(!event);
 
 	if (!event->attr.precise_ip)
-		return;
+		return 0;
 
 	n = top - at;
 	if (n <= 0) {
 		if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
 			intel_pmu_save_and_restart_reload(event, 0);
-		return;
+		return 0;
 	}
 
 	__intel_pmu_pebs_events(event, iregs, data, at, top, 0, n,
 				setup_pebs_fixed_sample_data);
+
+	return 0;
 }
 
 static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64 mask)
@@ -2817,7 +2819,7 @@ static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64
 	}
 }
 
-static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
@@ -2830,7 +2832,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 	u64 mask;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
@@ -2846,7 +2848,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 
 	if (unlikely(base >= top)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
-		return;
+		return 0;
 	}
 
 	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
@@ -2931,6 +2933,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 						setup_pebs_fixed_sample_data);
 		}
 	}
+
+	return 0;
 }
 
 static __always_inline void
@@ -2984,7 +2988,7 @@ __intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
 
 }
 
-static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
+static int intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
 	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
@@ -2997,7 +3001,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 	u64 mask;
 
 	if (!x86_pmu.pebs_active)
-		return;
+		return 0;
 
 	base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
@@ -3010,7 +3014,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 
 	if (unlikely(base >= top)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
-		return;
+		return 0;
 	}
 
 	if (!iregs)
@@ -3032,9 +3036,11 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 
 	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
 					    setup_pebs_adaptive_sample_data);
+
+	return 0;
 }
 
-static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
+static int intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 				      struct perf_sample_data *data)
 {
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
@@ -3044,13 +3050,14 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 	struct x86_perf_regs perf_regs;
 	struct pt_regs *regs = &perf_regs.regs;
 	void *base, *at, *top;
+	u64 events_bitmap = 0;
 	u64 mask;
 
 	rdmsrq(MSR_IA32_PEBS_INDEX, index.whole);
 
 	if (unlikely(!index.wr)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
-		return;
+		return 0;
 	}
 
 	base = cpuc->pebs_vaddr;
@@ -3089,6 +3096,7 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 
 		basic = at + sizeof(struct arch_pebs_header);
 		pebs_status = mask & basic->applicable_counters;
+		events_bitmap |= pebs_status;
 		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
 					       pebs_status, counts, last,
 					       setup_arch_pebs_sample_data);
@@ -3108,6 +3116,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask,
 					    counts, last,
 					    setup_arch_pebs_sample_data);
+
+	return hweight64(events_bitmap);
 }
 
 static void __init intel_arch_pebs_init(void)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 35a1837d0b77..98958f6d29b6 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1003,7 +1003,7 @@ struct x86_pmu {
 	int		pebs_record_size;
 	int		pebs_buffer_size;
 	u64		pebs_events_mask;
-	void		(*drain_pebs)(struct pt_regs *regs, struct perf_sample_data *data);
+	int		(*drain_pebs)(struct pt_regs *regs, struct perf_sample_data *data);
 	struct event_constraint *pebs_constraints;
 	void		(*pebs_aliases)(struct perf_event *event);
 	u64		(*pebs_latency_data)(struct perf_event *event, u64 status);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 17/19] perf headers: Sync with the kernel headers
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (15 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-03 23:43   ` Ian Rogers
                     ` (2 more replies)
  2025-12-03  6:54 ` [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format Dapeng Mi
                   ` (3 subsequent siblings)
  20 siblings, 3 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Update include/uapi/linux/perf_event.h and
arch/x86/include/uapi/asm/perf_regs.h to support extended regs.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
 tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
 2 files changed, 103 insertions(+), 4 deletions(-)

diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..f3561ed10041 100644
--- a/tools/arch/x86/include/uapi/asm/perf_regs.h
+++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,34 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/*
+	 * The EGPRs/SSP and XMM have overlaps. Only one can be used
+	 * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
+	 * utilize EGPRs/SSP. For the other ABI type, XMM is used.
+	 *
+	 * Extended GPRs (EGPRs)
+	 */
+	PERF_REG_X86_R16,
+	PERF_REG_X86_R17,
+	PERF_REG_X86_R18,
+	PERF_REG_X86_R19,
+	PERF_REG_X86_R20,
+	PERF_REG_X86_R21,
+	PERF_REG_X86_R22,
+	PERF_REG_X86_R23,
+	PERF_REG_X86_R24,
+	PERF_REG_X86_R25,
+	PERF_REG_X86_R26,
+	PERF_REG_X86_R27,
+	PERF_REG_X86_R28,
+	PERF_REG_X86_R29,
+	PERF_REG_X86_R30,
+	PERF_REG_X86_R31,
+	PERF_REG_X86_SSP,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
@@ -54,5 +79,42 @@ enum perf_event_x86_regs {
 };
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
+#define PERF_X86_EGPRS_MASK	GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
+
+enum {
+	PERF_REG_X86_XMM,
+	PERF_REG_X86_YMM,
+	PERF_REG_X86_ZMM,
+	PERF_REG_X86_MAX_SIMD_REGS,
+
+	PERF_REG_X86_OPMASK = 0,
+	PERF_REG_X86_MAX_PRED_REGS = 1,
+};
+
+enum {
+	PERF_X86_SIMD_XMM_REGS      = 16,
+	PERF_X86_SIMD_YMM_REGS      = 16,
+	PERF_X86_SIMD_ZMMH_REGS     = 16,
+	PERF_X86_SIMD_ZMM_REGS      = 32,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
+
+	PERF_X86_SIMD_OPMASK_REGS   = 8,
+	PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
+};
+
+#define PERF_X86_SIMD_PRED_MASK		GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
+#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
+
+#define PERF_X86_H16ZMM_BASE		PERF_X86_SIMD_ZMMH_REGS
+
+enum {
+	PERF_X86_OPMASK_QWORDS   = 1,
+	PERF_X86_XMM_QWORDS      = 2,
+	PERF_X86_YMMH_QWORDS     = 2,
+	PERF_X86_YMM_QWORDS      = 4,
+	PERF_X86_ZMMH_QWORDS     = 4,
+	PERF_X86_ZMM_QWORDS      = 8,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
+};
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index d292f96bc06f..f1474da32622 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -314,8 +314,9 @@ enum {
  */
 enum perf_sample_regs_abi {
 	PERF_SAMPLE_REGS_ABI_NONE		= 0,
-	PERF_SAMPLE_REGS_ABI_32			= 1,
-	PERF_SAMPLE_REGS_ABI_64			= 2,
+	PERF_SAMPLE_REGS_ABI_32			= (1 << 0),
+	PERF_SAMPLE_REGS_ABI_64			= (1 << 1),
+	PERF_SAMPLE_REGS_ABI_SIMD		= (1 << 2),
 };
 
 /*
@@ -382,6 +383,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
 #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
+#define PERF_ATTR_SIZE_VER9			168	/* Add: sample_simd_{pred,vec}_reg_* */
 
 /*
  * 'struct perf_event_attr' contains various attributes that define
@@ -545,6 +547,25 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+
+	/*
+	 * Defines set of SIMD registers to dump on samples.
+	 * The sample_simd_regs_enabled !=0 implies the
+	 * set of SIMD registers is used to config all SIMD registers.
+	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+	 * config some SIMD registers on X86.
+	 */
+	union {
+		__u16 sample_simd_regs_enabled;
+		__u16 sample_simd_pred_reg_qwords;
+	};
+	__u32 sample_simd_pred_reg_intr;
+	__u32 sample_simd_pred_reg_user;
+	__u16 sample_simd_vec_reg_qwords;
+	__u64 sample_simd_vec_reg_intr;
+	__u64 sample_simd_vec_reg_user;
+	__u32 __reserved_4;
 };
 
 /*
@@ -1018,7 +1039,15 @@ enum perf_event_type {
 	 *      } && PERF_SAMPLE_BRANCH_STACK
 	 *
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_USER
 	 *
 	 *	{ u64			size;
 	 *	  char			data[size];
@@ -1045,7 +1074,15 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (16 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 17/19] perf headers: Sync with the kernel headers Dapeng Mi
@ 2025-12-03  6:54 ` Dapeng Mi
  2025-12-04  0:17   ` Ian Rogers
  2026-01-20  7:39   ` Ian Rogers
  2025-12-03  6:55 ` [Patch v5 19/19] perf regs: Enable dumping of SIMD registers Dapeng Mi
                   ` (2 subsequent siblings)
  20 siblings, 2 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for the newly introduced SIMD register sampling
format by adding the following functions:

uint64_t arch__intr_simd_reg_mask(void);
uint64_t arch__user_simd_reg_mask(void);
uint64_t arch__intr_pred_reg_mask(void);
uint64_t arch__user_pred_reg_mask(void);
uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);

The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.

The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
supported PRED registers, such as OPMASK on x86 platforms.

The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
exact bitmap and number of qwords for a specific type of SIMD register.
For example, for XMM registers on x86 platforms, the returned bitmap is
0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).

The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
exact bitmap and number of qwords for a specific type of PRED register.
For example, for OPMASK registers on x86 platforms, the returned bitmap
is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
OPMASK).

Additionally, the function __parse_regs() is enhanced to support parsing
these newly introduced SIMD registers. Currently, each type of register
can only be sampled collectively; sampling a specific SIMD register is
not supported. For example, all XMM registers are sampled together rather
than sampling only XMM0.

When multiple overlapping register types, such as XMM and YMM, are
sampled simultaneously, only the superset (YMM registers) is sampled.

With this patch, all supported sampling registers on x86 platforms are
displayed as follows.

 $perf record -I?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

 $perf record --user-regs=?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
 tools/perf/util/evsel.c                   |  27 ++
 tools/perf/util/parse-regs-options.c      | 151 ++++++-
 tools/perf/util/perf_event_attr_fprintf.c |   6 +
 tools/perf/util/perf_regs.c               |  59 +++
 tools/perf/util/perf_regs.h               |  11 +
 tools/perf/util/record.h                  |   6 +
 7 files changed, 714 insertions(+), 16 deletions(-)

diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
index 12fd93f04802..db41430f3b07 100644
--- a/tools/perf/arch/x86/util/perf_regs.c
+++ b/tools/perf/arch/x86/util/perf_regs.c
@@ -13,6 +13,49 @@
 #include "../../../util/pmu.h"
 #include "../../../util/pmus.h"
 
+static const struct sample_reg sample_reg_masks_ext[] = {
+	SMPL_REG(AX, PERF_REG_X86_AX),
+	SMPL_REG(BX, PERF_REG_X86_BX),
+	SMPL_REG(CX, PERF_REG_X86_CX),
+	SMPL_REG(DX, PERF_REG_X86_DX),
+	SMPL_REG(SI, PERF_REG_X86_SI),
+	SMPL_REG(DI, PERF_REG_X86_DI),
+	SMPL_REG(BP, PERF_REG_X86_BP),
+	SMPL_REG(SP, PERF_REG_X86_SP),
+	SMPL_REG(IP, PERF_REG_X86_IP),
+	SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
+	SMPL_REG(CS, PERF_REG_X86_CS),
+	SMPL_REG(SS, PERF_REG_X86_SS),
+#ifdef HAVE_ARCH_X86_64_SUPPORT
+	SMPL_REG(R8, PERF_REG_X86_R8),
+	SMPL_REG(R9, PERF_REG_X86_R9),
+	SMPL_REG(R10, PERF_REG_X86_R10),
+	SMPL_REG(R11, PERF_REG_X86_R11),
+	SMPL_REG(R12, PERF_REG_X86_R12),
+	SMPL_REG(R13, PERF_REG_X86_R13),
+	SMPL_REG(R14, PERF_REG_X86_R14),
+	SMPL_REG(R15, PERF_REG_X86_R15),
+	SMPL_REG(R16, PERF_REG_X86_R16),
+	SMPL_REG(R17, PERF_REG_X86_R17),
+	SMPL_REG(R18, PERF_REG_X86_R18),
+	SMPL_REG(R19, PERF_REG_X86_R19),
+	SMPL_REG(R20, PERF_REG_X86_R20),
+	SMPL_REG(R21, PERF_REG_X86_R21),
+	SMPL_REG(R22, PERF_REG_X86_R22),
+	SMPL_REG(R23, PERF_REG_X86_R23),
+	SMPL_REG(R24, PERF_REG_X86_R24),
+	SMPL_REG(R25, PERF_REG_X86_R25),
+	SMPL_REG(R26, PERF_REG_X86_R26),
+	SMPL_REG(R27, PERF_REG_X86_R27),
+	SMPL_REG(R28, PERF_REG_X86_R28),
+	SMPL_REG(R29, PERF_REG_X86_R29),
+	SMPL_REG(R30, PERF_REG_X86_R30),
+	SMPL_REG(R31, PERF_REG_X86_R31),
+	SMPL_REG(SSP, PERF_REG_X86_SSP),
+#endif
+	SMPL_REG_END
+};
+
 static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG(AX, PERF_REG_X86_AX),
 	SMPL_REG(BX, PERF_REG_X86_BX),
@@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
 	return SDT_ARG_VALID;
 }
 
+static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
+{
+	struct perf_event_attr attr = {
+		.type				= PERF_TYPE_HARDWARE,
+		.config				= PERF_COUNT_HW_CPU_CYCLES,
+		.sample_type			= sample_type,
+		.disabled			= 1,
+		.exclude_kernel			= 1,
+		.sample_simd_regs_enabled	= 1,
+	};
+	int fd;
+
+	attr.sample_period = 1;
+
+	if (!pred) {
+		attr.sample_simd_vec_reg_qwords = qwords;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			attr.sample_simd_vec_reg_intr = mask;
+		else
+			attr.sample_simd_vec_reg_user = mask;
+	} else {
+		attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
+		else
+			attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
+	}
+
+	if (perf_pmus__num_core_pmus() > 1) {
+		struct perf_pmu *pmu = NULL;
+		__u64 type = PERF_TYPE_RAW;
+
+		/*
+		 * The same register set is supported among different hybrid PMUs.
+		 * Only check the first available one.
+		 */
+		while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
+			type = pmu->type;
+			break;
+		}
+		attr.config |= type << PERF_PMU_TYPE_SHIFT;
+	}
+
+	event_attr_init(&attr);
+
+	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+	if (fd != -1) {
+		close(fd);
+		return true;
+	}
+
+	return false;
+}
+
+static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
+{
+	bool supported = false;
+	u64 bits;
+
+	*mask = 0;
+	*qwords = 0;
+
+	switch (reg) {
+	case PERF_REG_X86_XMM:
+		bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_XMM_QWORDS;
+		}
+		break;
+	case PERF_REG_X86_YMM:
+		bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_YMM_QWORDS;
+		}
+		break;
+	case PERF_REG_X86_ZMM:
+		bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_ZMM_QWORDS;
+			break;
+		}
+
+		bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_ZMMH_QWORDS;
+		}
+		break;
+	default:
+		break;
+	}
+
+	return supported;
+}
+
+static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
+{
+	bool supported = false;
+	u64 bits;
+
+	*mask = 0;
+	*qwords = 0;
+
+	switch (reg) {
+	case PERF_REG_X86_OPMASK:
+		bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_OPMASK_QWORDS;
+		}
+		break;
+	default:
+		break;
+	}
+
+	return supported;
+}
+
+static bool has_cap_simd_regs(void)
+{
+	uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
+	u16 qwords = PERF_X86_XMM_QWORDS;
+	static bool has_cap_simd_regs;
+	static bool cached;
+
+	if (cached)
+		return has_cap_simd_regs;
+
+	has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
+						 PERF_REG_X86_XMM, &mask, &qwords);
+	has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
+						 PERF_REG_X86_XMM, &mask, &qwords);
+	cached = true;
+
+	return has_cap_simd_regs;
+}
+
+bool arch_has_simd_regs(u64 mask)
+{
+	return has_cap_simd_regs() &&
+	       mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
+}
+
+static const struct sample_reg sample_simd_reg_masks[] = {
+	SMPL_REG(XMM, PERF_REG_X86_XMM),
+	SMPL_REG(YMM, PERF_REG_X86_YMM),
+	SMPL_REG(ZMM, PERF_REG_X86_ZMM),
+	SMPL_REG_END
+};
+
+static const struct sample_reg sample_pred_reg_masks[] = {
+	SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
+	SMPL_REG_END
+};
+
+const struct sample_reg *arch__sample_simd_reg_masks(void)
+{
+	return sample_simd_reg_masks;
+}
+
+const struct sample_reg *arch__sample_pred_reg_masks(void)
+{
+	return sample_pred_reg_masks;
+}
+
+static bool x86_intr_simd_updated;
+static u64 x86_intr_simd_reg_mask;
+static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
+static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
+static bool x86_user_simd_updated;
+static u64 x86_user_simd_reg_mask;
+static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
+static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
+
+static bool x86_intr_pred_updated;
+static u64 x86_intr_pred_reg_mask;
+static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
+static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
+static bool x86_user_pred_updated;
+static u64 x86_user_pred_reg_mask;
+static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
+static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
+
+static uint64_t __arch__simd_reg_mask(u64 sample_type)
+{
+	const struct sample_reg *r = NULL;
+	bool supported;
+	u64 mask = 0;
+	int reg;
+
+	if (!has_cap_simd_regs())
+		return 0;
+
+	if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
+		return x86_intr_simd_reg_mask;
+
+	if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
+		return x86_user_simd_reg_mask;
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		supported = false;
+
+		if (!r->mask)
+			continue;
+		reg = fls64(r->mask) - 1;
+
+		if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
+			break;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			supported = __arch_simd_reg_mask(sample_type, reg,
+							 &x86_intr_simd_mask[reg],
+							 &x86_intr_simd_qwords[reg]);
+		else if (sample_type == PERF_SAMPLE_REGS_USER)
+			supported = __arch_simd_reg_mask(sample_type, reg,
+							 &x86_user_simd_mask[reg],
+							 &x86_user_simd_qwords[reg]);
+		if (supported)
+			mask |= BIT_ULL(reg);
+	}
+
+	if (sample_type == PERF_SAMPLE_REGS_INTR) {
+		x86_intr_simd_reg_mask = mask;
+		x86_intr_simd_updated = true;
+	} else {
+		x86_user_simd_reg_mask = mask;
+		x86_user_simd_updated = true;
+	}
+
+	return mask;
+}
+
+static uint64_t __arch__pred_reg_mask(u64 sample_type)
+{
+	const struct sample_reg *r = NULL;
+	bool supported;
+	u64 mask = 0;
+	int reg;
+
+	if (!has_cap_simd_regs())
+		return 0;
+
+	if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
+		return x86_intr_pred_reg_mask;
+
+	if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
+		return x86_user_pred_reg_mask;
+
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		supported = false;
+
+		if (!r->mask)
+			continue;
+		reg = fls64(r->mask) - 1;
+
+		if (reg >= PERF_REG_X86_MAX_PRED_REGS)
+			break;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			supported = __arch_pred_reg_mask(sample_type, reg,
+							 &x86_intr_pred_mask[reg],
+							 &x86_intr_pred_qwords[reg]);
+		else if (sample_type == PERF_SAMPLE_REGS_USER)
+			supported = __arch_pred_reg_mask(sample_type, reg,
+							 &x86_user_pred_mask[reg],
+							 &x86_user_pred_qwords[reg]);
+		if (supported)
+			mask |= BIT_ULL(reg);
+	}
+
+	if (sample_type == PERF_SAMPLE_REGS_INTR) {
+		x86_intr_pred_reg_mask = mask;
+		x86_intr_pred_updated = true;
+	} else {
+		x86_user_pred_reg_mask = mask;
+		x86_user_pred_updated = true;
+	}
+
+	return mask;
+}
+
+uint64_t arch__intr_simd_reg_mask(void)
+{
+	return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
+}
+
+uint64_t arch__user_simd_reg_mask(void)
+{
+	return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
+}
+
+uint64_t arch__intr_pred_reg_mask(void)
+{
+	return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
+}
+
+uint64_t arch__user_pred_reg_mask(void)
+{
+	return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
+}
+
+static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
+{
+	uint64_t mask = 0;
+
+	*qwords = 0;
+	if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
+		if (intr) {
+			*qwords = x86_intr_simd_qwords[reg];
+			mask = x86_intr_simd_mask[reg];
+		} else {
+			*qwords = x86_user_simd_qwords[reg];
+			mask = x86_user_simd_mask[reg];
+		}
+	}
+
+	return mask;
+}
+
+static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
+{
+	uint64_t mask = 0;
+
+	*qwords = 0;
+	if (reg < PERF_REG_X86_MAX_PRED_REGS) {
+		if (intr) {
+			*qwords = x86_intr_pred_qwords[reg];
+			mask = x86_intr_pred_mask[reg];
+		} else {
+			*qwords = x86_user_pred_qwords[reg];
+			mask = x86_user_pred_mask[reg];
+		}
+	}
+
+	return mask;
+}
+
+uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_intr_simd_updated)
+		arch__intr_simd_reg_mask();
+	return arch__simd_reg_bitmap_qwords(reg, qwords, true);
+}
+
+uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_user_simd_updated)
+		arch__user_simd_reg_mask();
+	return arch__simd_reg_bitmap_qwords(reg, qwords, false);
+}
+
+uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_intr_pred_updated)
+		arch__intr_pred_reg_mask();
+	return arch__pred_reg_bitmap_qwords(reg, qwords, true);
+}
+
+uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_user_pred_updated)
+		arch__user_pred_reg_mask();
+	return arch__pred_reg_bitmap_qwords(reg, qwords, false);
+}
+
 const struct sample_reg *arch__sample_reg_masks(void)
 {
+	if (has_cap_simd_regs())
+		return sample_reg_masks_ext;
 	return sample_reg_masks;
 }
 
-uint64_t arch__intr_reg_mask(void)
+static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
 {
 	struct perf_event_attr attr = {
-		.type			= PERF_TYPE_HARDWARE,
-		.config			= PERF_COUNT_HW_CPU_CYCLES,
-		.sample_type		= PERF_SAMPLE_REGS_INTR,
-		.sample_regs_intr	= PERF_REG_EXTENDED_MASK,
-		.precise_ip		= 1,
-		.disabled 		= 1,
-		.exclude_kernel		= 1,
+		.type				= PERF_TYPE_HARDWARE,
+		.config				= PERF_COUNT_HW_CPU_CYCLES,
+		.sample_type			= sample_type,
+		.precise_ip			= 1,
+		.disabled			= 1,
+		.exclude_kernel			= 1,
+		.sample_simd_regs_enabled	= has_simd_regs,
 	};
 	int fd;
 	/*
 	 * In an unnamed union, init it here to build on older gcc versions
 	 */
 	attr.sample_period = 1;
+	if (sample_type == PERF_SAMPLE_REGS_INTR)
+		attr.sample_regs_intr = mask;
+	else
+		attr.sample_regs_user = mask;
 
 	if (perf_pmus__num_core_pmus() > 1) {
 		struct perf_pmu *pmu = NULL;
@@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
 	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
 	if (fd != -1) {
 		close(fd);
-		return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
+		return mask;
 	}
 
-	return PERF_REGS_MASK;
+	return 0;
+}
+
+uint64_t arch__intr_reg_mask(void)
+{
+	uint64_t mask = PERF_REGS_MASK;
+
+	if (has_cap_simd_regs()) {
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
+					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
+					 true);
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
+					 BIT_ULL(PERF_REG_X86_SSP),
+					 true);
+	} else
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
+
+	return mask;
 }
 
 uint64_t arch__user_reg_mask(void)
 {
-	return PERF_REGS_MASK;
+	uint64_t mask = PERF_REGS_MASK;
+
+	if (has_cap_simd_regs()) {
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
+					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
+					 true);
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
+					 BIT_ULL(PERF_REG_X86_SSP),
+					 true);
+	}
+
+	return mask;
 }
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 56ebefd075f2..5d1d90cf9488 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 	if (opts->sample_intr_regs && !evsel->no_aux_samples &&
 	    !evsel__is_dummy_event(evsel)) {
 		attr->sample_regs_intr = opts->sample_intr_regs;
+		attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
+		evsel__set_sample_bit(evsel, REGS_INTR);
+	}
+
+	if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
+	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+		/* The pred qwords is to implies the set of SIMD registers is used */
+		if (opts->sample_pred_regs_qwords)
+			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
+		else
+			attr->sample_simd_pred_reg_qwords = 1;
+		attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
+		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
+		attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
 		evsel__set_sample_bit(evsel, REGS_INTR);
 	}
 
 	if (opts->sample_user_regs && !evsel->no_aux_samples &&
 	    !evsel__is_dummy_event(evsel)) {
 		attr->sample_regs_user |= opts->sample_user_regs;
+		attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
+		evsel__set_sample_bit(evsel, REGS_USER);
+	}
+
+	if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
+	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+		if (opts->sample_pred_regs_qwords)
+			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
+		else
+			attr->sample_simd_pred_reg_qwords = 1;
+		attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
+		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
+		attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
 		evsel__set_sample_bit(evsel, REGS_USER);
 	}
 
diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
index cda1c620968e..0bd100392889 100644
--- a/tools/perf/util/parse-regs-options.c
+++ b/tools/perf/util/parse-regs-options.c
@@ -4,19 +4,139 @@
 #include <stdint.h>
 #include <string.h>
 #include <stdio.h>
+#include <linux/bitops.h>
 #include "util/debug.h"
 #include <subcmd/parse-options.h>
 #include "util/perf_regs.h"
 #include "util/parse-regs-options.h"
+#include "record.h"
+
+static void __print_simd_regs(bool intr, uint64_t simd_mask)
+{
+	const struct sample_reg *r = NULL;
+	uint64_t bitmap = 0;
+	u16 qwords = 0;
+	int reg_idx;
+
+	if (!simd_mask)
+		return;
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		if (!(r->mask & simd_mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		if (bitmap)
+			fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
+	}
+}
+
+static void __print_pred_regs(bool intr, uint64_t pred_mask)
+{
+	const struct sample_reg *r = NULL;
+	uint64_t bitmap = 0;
+	u16 qwords = 0;
+	int reg_idx;
+
+	if (!pred_mask)
+		return;
+
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		if (!(r->mask & pred_mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		if (bitmap)
+			fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
+	}
+}
+
+static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
+{
+	const struct sample_reg *r = NULL;
+	bool matched = false;
+	uint64_t bitmap = 0;
+	u16 qwords = 0;
+	int reg_idx;
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		if (strcasecmp(s, r->name))
+			continue;
+		if (!fls64(r->mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		matched = true;
+		break;
+	}
+
+	/* Just need the highest qwords */
+	if (qwords > opts->sample_vec_regs_qwords) {
+		opts->sample_vec_regs_qwords = qwords;
+		if (intr)
+			opts->sample_intr_vec_regs = bitmap;
+		else
+			opts->sample_user_vec_regs = bitmap;
+	}
+
+	return matched;
+}
+
+static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
+{
+	const struct sample_reg *r = NULL;
+	bool matched = false;
+	uint64_t bitmap = 0;
+	u16 qwords = 0;
+	int reg_idx;
+
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		if (strcasecmp(s, r->name))
+			continue;
+		if (!fls64(r->mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		matched = true;
+		break;
+	}
+
+	/* Just need the highest qwords */
+	if (qwords > opts->sample_pred_regs_qwords) {
+		opts->sample_pred_regs_qwords = qwords;
+		if (intr)
+			opts->sample_intr_pred_regs = bitmap;
+		else
+			opts->sample_user_pred_regs = bitmap;
+	}
+
+	return matched;
+}
 
 static int
 __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 {
 	uint64_t *mode = (uint64_t *)opt->value;
 	const struct sample_reg *r = NULL;
+	struct record_opts *opts;
 	char *s, *os = NULL, *p;
-	int ret = -1;
+	bool has_simd_regs = false;
 	uint64_t mask;
+	uint64_t simd_mask;
+	uint64_t pred_mask;
+	int ret = -1;
 
 	if (unset)
 		return 0;
@@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 	if (*mode)
 		return -1;
 
-	if (intr)
+	if (intr) {
+		opts = container_of(opt->value, struct record_opts, sample_intr_regs);
 		mask = arch__intr_reg_mask();
-	else
+		simd_mask = arch__intr_simd_reg_mask();
+		pred_mask = arch__intr_pred_reg_mask();
+	} else {
+		opts = container_of(opt->value, struct record_opts, sample_user_regs);
 		mask = arch__user_reg_mask();
+		simd_mask = arch__user_simd_reg_mask();
+		pred_mask = arch__user_pred_reg_mask();
+	}
 
 	/* str may be NULL in case no arg is passed to -I */
 	if (str) {
@@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 					if (r->mask & mask)
 						fprintf(stderr, "%s ", r->name);
 				}
+				__print_simd_regs(intr, simd_mask);
+				__print_pred_regs(intr, pred_mask);
 				fputc('\n', stderr);
 				/* just printing available regs */
 				goto error;
 			}
+
+			if (simd_mask) {
+				has_simd_regs = __parse_simd_regs(opts, s, intr);
+				if (has_simd_regs)
+					goto next;
+			}
+			if (pred_mask) {
+				has_simd_regs = __parse_pred_regs(opts, s, intr);
+				if (has_simd_regs)
+					goto next;
+			}
+
 			for (r = arch__sample_reg_masks(); r->name; r++) {
 				if ((r->mask & mask) && !strcasecmp(s, r->name))
 					break;
@@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 			}
 
 			*mode |= r->mask;
-
+next:
 			if (!p)
 				break;
 
@@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 	ret = 0;
 
 	/* default to all possible regs */
-	if (*mode == 0)
+	if (*mode == 0 && !has_simd_regs)
 		*mode = mask;
 error:
 	free(os);
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 66b666d9ce64..fb0366d050cf 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(aux_start_paused, p_unsigned);
 	PRINT_ATTRf(aux_pause, p_unsigned);
 	PRINT_ATTRf(aux_resume, p_unsigned);
+	PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
+	PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
+	PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
+	PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
+	PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
+	PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
 
 	return ret;
 }
diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
index 44b90bbf2d07..e8a9fabc92e6 100644
--- a/tools/perf/util/perf_regs.c
+++ b/tools/perf/util/perf_regs.c
@@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
 	return SDT_ARG_SKIP;
 }
 
+bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
+{
+	return false;
+}
+
 uint64_t __weak arch__intr_reg_mask(void)
 {
 	return 0;
@@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
 	return 0;
 }
 
+uint64_t __weak arch__intr_simd_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__user_simd_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__intr_pred_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__user_pred_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
 static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG_END
 };
@@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
 	return sample_reg_masks;
 }
 
+const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
+{
+	return sample_reg_masks;
+}
+
+const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
+{
+	return sample_reg_masks;
+}
+
 const char *perf_reg_name(int id, const char *arch)
 {
 	const char *reg_name = NULL;
diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
index f2d0736d65cc..bce9c4cfd1bf 100644
--- a/tools/perf/util/perf_regs.h
+++ b/tools/perf/util/perf_regs.h
@@ -24,9 +24,20 @@ enum {
 };
 
 int arch_sdt_arg_parse_op(char *old_op, char **new_op);
+bool arch_has_simd_regs(u64 mask);
 uint64_t arch__intr_reg_mask(void);
 uint64_t arch__user_reg_mask(void);
 const struct sample_reg *arch__sample_reg_masks(void);
+const struct sample_reg *arch__sample_simd_reg_masks(void);
+const struct sample_reg *arch__sample_pred_reg_masks(void);
+uint64_t arch__intr_simd_reg_mask(void);
+uint64_t arch__user_simd_reg_mask(void);
+uint64_t arch__intr_pred_reg_mask(void);
+uint64_t arch__user_pred_reg_mask(void);
+uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
+uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
+uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
+uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
 
 const char *perf_reg_name(int id, const char *arch);
 int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index ea3a6c4657ee..825ffb4cc53f 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -59,7 +59,13 @@ struct record_opts {
 	unsigned int  user_freq;
 	u64	      branch_stack;
 	u64	      sample_intr_regs;
+	u64	      sample_intr_vec_regs;
 	u64	      sample_user_regs;
+	u64	      sample_user_vec_regs;
+	u16	      sample_pred_regs_qwords;
+	u16	      sample_vec_regs_qwords;
+	u16	      sample_intr_pred_regs;
+	u16	      sample_user_pred_regs;
 	u64	      default_interval;
 	u64	      user_interval;
 	size_t	      auxtrace_snapshot_size;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [Patch v5 19/19] perf regs: Enable dumping of SIMD registers
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (17 preceding siblings ...)
  2025-12-03  6:54 ` [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format Dapeng Mi
@ 2025-12-03  6:55 ` Dapeng Mi
  2025-12-04  0:24 ` [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Ian Rogers
  2025-12-16  4:42 ` Ravi Bangoria
  20 siblings, 0 replies; 86+ messages in thread
From: Dapeng Mi @ 2025-12-03  6:55 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

This patch adds support for dumping SIMD registers using the new
PERF_SAMPLE_REGS_ABI_SIMD ABI.

Currently, the XMM, YMM, ZMM, OPMASK, eGPRs, and SSP registers on x86
platforms are supported with the PERF_SAMPLE_REGS_ABI_SIMD ABI.

An example of the output is displayed below.

Example:

 $perf record -e cycles:p -IXMM,YMM,OPMASK,SSP ./test
 $perf report -D
 ... ...
 237538985992962 0x454d0 [0x480]: PERF_RECORD_SAMPLE(IP, 0x1):
 179370/179370: 0xffffffff969627fc period: 124999 addr: 0
 ... intr regs: mask 0x20000000000 ABI 64-bit
 .... SSP   0x0000000000000000
 ... SIMD ABI nr_vectors 32 vector_qwords 4 nr_pred 8 pred_qwords 1
 .... YMM  [0] 0x0000000000004000
 .... YMM  [0] 0x000055e828695270
 .... YMM  [0] 0x0000000000000000
 .... YMM  [0] 0x0000000000000000
 .... YMM  [1] 0x000055e8286990e0
 .... YMM  [1] 0x000055e828698dd0
 .... YMM  [1] 0x0000000000000000
 .... YMM  [1] 0x0000000000000000
 ... ...
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... OPMASK[0] 0x0000000000100221
 .... OPMASK[1] 0x0000000000000020
 .... OPMASK[2] 0x000000007fffffff
 .... OPMASK[3] 0x0000000000000000
 .... OPMASK[4] 0x0000000000000000
 .... OPMASK[5] 0x0000000000000000
 .... OPMASK[6] 0x0000000000000000
 .... OPMASK[7] 0x0000000000000000
 ... ...

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/perf/util/evsel.c                       | 20 +++++
 .../perf/util/perf-regs-arch/perf_regs_x86.c  | 43 ++++++++++
 tools/perf/util/sample.h                      | 10 +++
 tools/perf/util/session.c                     | 78 +++++++++++++++++--
 4 files changed, 143 insertions(+), 8 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 5d1d90cf9488..8f3fafe3a43f 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -3347,6 +3347,16 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 			regs->mask = mask;
 			regs->regs = (u64 *)array;
 			array = (void *)array + sz;
+
+			if (regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				regs->config = *(u64 *)array;
+				array = (void *)array + sizeof(u64);
+				regs->data = (u64 *)array;
+				sz = (regs->nr_vectors * regs->vector_qwords +
+				      regs->nr_pred * regs->pred_qwords) * sizeof(u64);
+				OVERFLOW_CHECK(array, sz, max_size);
+				array = (void *)array + sz;
+			}
 		}
 	}
 
@@ -3404,6 +3414,16 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 			regs->mask = mask;
 			regs->regs = (u64 *)array;
 			array = (void *)array + sz;
+
+			if (regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				regs->config = *(u64 *)array;
+				array = (void *)array + sizeof(u64);
+				regs->data = (u64 *)array;
+				sz = (regs->nr_vectors * regs->vector_qwords +
+				      regs->nr_pred * regs->pred_qwords) * sizeof(u64);
+				OVERFLOW_CHECK(array, sz, max_size);
+				array = (void *)array + sz;
+			}
 		}
 	}
 
diff --git a/tools/perf/util/perf-regs-arch/perf_regs_x86.c b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
index 708954a9d35d..32dac438b12d 100644
--- a/tools/perf/util/perf-regs-arch/perf_regs_x86.c
+++ b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
@@ -5,6 +5,49 @@
 
 const char *__perf_reg_name_x86(int id)
 {
+	if (id > PERF_REG_X86_R15 && arch__intr_simd_reg_mask()) {
+		switch (id) {
+		case PERF_REG_X86_R16:
+			return "R16";
+		case PERF_REG_X86_R17:
+			return "R17";
+		case PERF_REG_X86_R18:
+			return "R18";
+		case PERF_REG_X86_R19:
+			return "R19";
+		case PERF_REG_X86_R20:
+			return "R20";
+		case PERF_REG_X86_R21:
+			return "R21";
+		case PERF_REG_X86_R22:
+			return "R22";
+		case PERF_REG_X86_R23:
+			return "R23";
+		case PERF_REG_X86_R24:
+			return "R24";
+		case PERF_REG_X86_R25:
+			return "R25";
+		case PERF_REG_X86_R26:
+			return "R26";
+		case PERF_REG_X86_R27:
+			return "R27";
+		case PERF_REG_X86_R28:
+			return "R28";
+		case PERF_REG_X86_R29:
+			return "R29";
+		case PERF_REG_X86_R30:
+			return "R30";
+		case PERF_REG_X86_R31:
+			return "R31";
+		case PERF_REG_X86_SSP:
+			return "SSP";
+		default:
+			return NULL;
+		}
+
+		return NULL;
+	}
+
 	switch (id) {
 	case PERF_REG_X86_AX:
 		return "AX";
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index fae834144ef4..3b247e0e8242 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -12,6 +12,16 @@ struct regs_dump {
 	u64 abi;
 	u64 mask;
 	u64 *regs;
+	union {
+		u64 config;
+		struct {
+			u16 nr_vectors;
+			u16 vector_qwords;
+			u16 nr_pred;
+			u16 pred_qwords;
+		};
+	};
+	u64 *data;
 
 	/* Cached values/mask filled by first register access. */
 	u64 cache_regs[PERF_SAMPLE_REGS_CACHE_SIZE];
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 09af486c83e4..c692be265c21 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -927,18 +927,78 @@ static void regs_dump__printf(u64 mask, u64 *regs, const char *arch)
 	}
 }
 
-static const char *regs_abi[] = {
-	[PERF_SAMPLE_REGS_ABI_NONE] = "none",
-	[PERF_SAMPLE_REGS_ABI_32] = "32-bit",
-	[PERF_SAMPLE_REGS_ABI_64] = "64-bit",
-};
+static void simd_regs_dump__printf(struct regs_dump *regs, bool intr)
+{
+	const char *name = "unknown";
+	const struct sample_reg *r;
+	int i, idx = 0;
+	u16 qwords;
+	int reg_idx;
+
+	if (!(regs->abi & PERF_SAMPLE_REGS_ABI_SIMD))
+		return;
+
+	printf("... SIMD ABI nr_vectors %d vector_qwords %d nr_pred %d pred_qwords %d\n",
+	       regs->nr_vectors, regs->vector_qwords,
+	       regs->nr_pred, regs->pred_qwords);
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		if (!fls64(r->mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		if (regs->vector_qwords == qwords) {
+			name = r->name;
+			break;
+		}
+	}
+
+	for (i = 0; i < regs->nr_vectors; i++) {
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		if (regs->vector_qwords > 2) {
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		}
+		if (regs->vector_qwords > 4) {
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		}
+	}
+
+	name = "unknown";
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		if (!fls64(r->mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		if (regs->pred_qwords == qwords) {
+			name = r->name;
+			break;
+		}
+	}
+	for (i = 0; i < regs->nr_pred; i++)
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+}
 
 static inline const char *regs_dump_abi(struct regs_dump *d)
 {
-	if (d->abi > PERF_SAMPLE_REGS_ABI_64)
-		return "unknown";
+	if (!d->abi)
+		return "none";
+	if (d->abi & PERF_SAMPLE_REGS_ABI_32)
+		return "32-bit";
+	else if (d->abi & PERF_SAMPLE_REGS_ABI_64)
+		return "64-bit";
 
-	return regs_abi[d->abi];
+	return "unknown";
 }
 
 static void regs__printf(const char *type, struct regs_dump *regs, const char *arch)
@@ -964,6 +1024,7 @@ static void regs_user__printf(struct perf_sample *sample, const char *arch)
 
 	if (user_regs->regs)
 		regs__printf("user", user_regs, arch);
+	simd_regs_dump__printf(user_regs, false);
 }
 
 static void regs_intr__printf(struct perf_sample *sample, const char *arch)
@@ -977,6 +1038,7 @@ static void regs_intr__printf(struct perf_sample *sample, const char *arch)
 
 	if (intr_regs->regs)
 		regs__printf("intr", intr_regs, arch);
+	simd_regs_dump__printf(intr_regs, true);
 }
 
 static void stack_user__printf(struct stack_dump *dump)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2025-12-03  6:54 ` [Patch v5 17/19] perf headers: Sync with the kernel headers Dapeng Mi
@ 2025-12-03 23:43   ` Ian Rogers
  2025-12-04  1:37     ` Mi, Dapeng
  2026-01-20  7:01   ` Ian Rogers
  2026-01-20  7:16   ` Ian Rogers
  2 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2025-12-03 23:43 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Update include/uapi/linux/perf_event.h and
> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
>  2 files changed, 103 insertions(+), 4 deletions(-)
>
> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
> index 7c9d2bb3833b..f3561ed10041 100644
> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
>         PERF_REG_X86_R13,
>         PERF_REG_X86_R14,
>         PERF_REG_X86_R15,
> +       /*
> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
> +        *
> +        * Extended GPRs (EGPRs)
> +        */
> +       PERF_REG_X86_R16,
> +       PERF_REG_X86_R17,
> +       PERF_REG_X86_R18,
> +       PERF_REG_X86_R19,
> +       PERF_REG_X86_R20,
> +       PERF_REG_X86_R21,
> +       PERF_REG_X86_R22,
> +       PERF_REG_X86_R23,
> +       PERF_REG_X86_R24,
> +       PERF_REG_X86_R25,
> +       PERF_REG_X86_R26,
> +       PERF_REG_X86_R27,
> +       PERF_REG_X86_R28,
> +       PERF_REG_X86_R29,
> +       PERF_REG_X86_R30,
> +       PERF_REG_X86_R31,
> +       PERF_REG_X86_SSP,
>         /* These are the limits for the GPRs. */
>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,

I wonder MISC isn't the most intention revealing name. What happens if
things are extended again? Would APX be a better alternative, so
PERF_REG_APX_MAX ?

>
>         /* These all need two bits set because they are 128bit */
>         PERF_REG_X86_XMM0  = 32,
> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
>  };
>
>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
> +
> +enum {
> +       PERF_REG_X86_XMM,
> +       PERF_REG_X86_YMM,
> +       PERF_REG_X86_ZMM,
> +       PERF_REG_X86_MAX_SIMD_REGS,
> +
> +       PERF_REG_X86_OPMASK = 0,
> +       PERF_REG_X86_MAX_PRED_REGS = 1,
> +};
> +
> +enum {
> +       PERF_X86_SIMD_XMM_REGS      = 16,
> +       PERF_X86_SIMD_YMM_REGS      = 16,
> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
> +       PERF_X86_SIMD_ZMM_REGS      = 32,
> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
> +
> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
> +};
> +
> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
> +
> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
> +
> +enum {
> +       PERF_X86_OPMASK_QWORDS   = 1,
> +       PERF_X86_XMM_QWORDS      = 2,
> +       PERF_X86_YMMH_QWORDS     = 2,
> +       PERF_X86_YMM_QWORDS      = 4,
> +       PERF_X86_ZMMH_QWORDS     = 4,
> +       PERF_X86_ZMM_QWORDS      = 8,
> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
> +};
>
>  #endif /* _ASM_X86_PERF_REGS_H */
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index d292f96bc06f..f1474da32622 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -314,8 +314,9 @@ enum {
>   */
>  enum perf_sample_regs_abi {
>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
>  };
>
>  /*
> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */

ARM have added a config4 in:
https://lore.kernel.org/lkml/20251111-james-perf-feat_spe_eft-v10-1-1e1b5bf2cd05@linaro.org/
so this will need to be VER10.

Thanks,
Ian

>
>  /*
>   * 'struct perf_event_attr' contains various attributes that define
> @@ -545,6 +547,25 @@ struct perf_event_attr {
>         __u64   sig_data;
>
>         __u64   config3; /* extension of config2 */
> +
> +
> +       /*
> +        * Defines set of SIMD registers to dump on samples.
> +        * The sample_simd_regs_enabled !=0 implies the
> +        * set of SIMD registers is used to config all SIMD registers.
> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> +        * config some SIMD registers on X86.
> +        */
> +       union {
> +               __u16 sample_simd_regs_enabled;
> +               __u16 sample_simd_pred_reg_qwords;
> +       };
> +       __u32 sample_simd_pred_reg_intr;
> +       __u32 sample_simd_pred_reg_user;
> +       __u16 sample_simd_vec_reg_qwords;
> +       __u64 sample_simd_vec_reg_intr;
> +       __u64 sample_simd_vec_reg_user;
> +       __u32 __reserved_4;
>  };
>
>  /*
> @@ -1018,7 +1039,15 @@ enum perf_event_type {
>          *      } && PERF_SAMPLE_BRANCH_STACK
>          *
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_USER
>          *
>          *      { u64                   size;
>          *        char                  data[size];
> @@ -1045,7 +1074,15 @@ enum perf_event_type {
>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_INTR
>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-03  6:54 ` [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format Dapeng Mi
@ 2025-12-04  0:17   ` Ian Rogers
  2025-12-04  2:58     ` Mi, Dapeng
  2026-01-20  7:39   ` Ian Rogers
  1 sibling, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2025-12-04  0:17 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> This patch adds support for the newly introduced SIMD register sampling
> format by adding the following functions:
>
> uint64_t arch__intr_simd_reg_mask(void);
> uint64_t arch__user_simd_reg_mask(void);
> uint64_t arch__intr_pred_reg_mask(void);
> uint64_t arch__user_pred_reg_mask(void);
> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>
> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>
> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> supported PRED registers, such as OPMASK on x86 platforms.
>
> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> exact bitmap and number of qwords for a specific type of SIMD register.
> For example, for XMM registers on x86 platforms, the returned bitmap is
> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>
> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> exact bitmap and number of qwords for a specific type of PRED register.
> For example, for OPMASK registers on x86 platforms, the returned bitmap
> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> OPMASK).
>
> Additionally, the function __parse_regs() is enhanced to support parsing
> these newly introduced SIMD registers. Currently, each type of register
> can only be sampled collectively; sampling a specific SIMD register is
> not supported. For example, all XMM registers are sampled together rather
> than sampling only XMM0.
>
> When multiple overlapping register types, such as XMM and YMM, are
> sampled simultaneously, only the superset (YMM registers) is sampled.
>
> With this patch, all supported sampling registers on x86 platforms are
> displayed as follows.
>
>  $perf record -I?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
>  $perf record --user-regs=?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>  tools/perf/util/evsel.c                   |  27 ++
>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>  tools/perf/util/perf_regs.c               |  59 +++
>  tools/perf/util/perf_regs.h               |  11 +
>  tools/perf/util/record.h                  |   6 +
>  7 files changed, 714 insertions(+), 16 deletions(-)
>
> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> index 12fd93f04802..db41430f3b07 100644
> --- a/tools/perf/arch/x86/util/perf_regs.c
> +++ b/tools/perf/arch/x86/util/perf_regs.c
> @@ -13,6 +13,49 @@
>  #include "../../../util/pmu.h"
>  #include "../../../util/pmus.h"
>
> +static const struct sample_reg sample_reg_masks_ext[] = {
> +       SMPL_REG(AX, PERF_REG_X86_AX),
> +       SMPL_REG(BX, PERF_REG_X86_BX),
> +       SMPL_REG(CX, PERF_REG_X86_CX),
> +       SMPL_REG(DX, PERF_REG_X86_DX),
> +       SMPL_REG(SI, PERF_REG_X86_SI),
> +       SMPL_REG(DI, PERF_REG_X86_DI),
> +       SMPL_REG(BP, PERF_REG_X86_BP),
> +       SMPL_REG(SP, PERF_REG_X86_SP),
> +       SMPL_REG(IP, PERF_REG_X86_IP),
> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> +       SMPL_REG(CS, PERF_REG_X86_CS),
> +       SMPL_REG(SS, PERF_REG_X86_SS),
> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> +       SMPL_REG(R8, PERF_REG_X86_R8),
> +       SMPL_REG(R9, PERF_REG_X86_R9),
> +       SMPL_REG(R10, PERF_REG_X86_R10),
> +       SMPL_REG(R11, PERF_REG_X86_R11),
> +       SMPL_REG(R12, PERF_REG_X86_R12),
> +       SMPL_REG(R13, PERF_REG_X86_R13),
> +       SMPL_REG(R14, PERF_REG_X86_R14),
> +       SMPL_REG(R15, PERF_REG_X86_R15),
> +       SMPL_REG(R16, PERF_REG_X86_R16),
> +       SMPL_REG(R17, PERF_REG_X86_R17),
> +       SMPL_REG(R18, PERF_REG_X86_R18),
> +       SMPL_REG(R19, PERF_REG_X86_R19),
> +       SMPL_REG(R20, PERF_REG_X86_R20),
> +       SMPL_REG(R21, PERF_REG_X86_R21),
> +       SMPL_REG(R22, PERF_REG_X86_R22),
> +       SMPL_REG(R23, PERF_REG_X86_R23),
> +       SMPL_REG(R24, PERF_REG_X86_R24),
> +       SMPL_REG(R25, PERF_REG_X86_R25),
> +       SMPL_REG(R26, PERF_REG_X86_R26),
> +       SMPL_REG(R27, PERF_REG_X86_R27),
> +       SMPL_REG(R28, PERF_REG_X86_R28),
> +       SMPL_REG(R29, PERF_REG_X86_R29),
> +       SMPL_REG(R30, PERF_REG_X86_R30),
> +       SMPL_REG(R31, PERF_REG_X86_R31),
> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> +#endif
> +       SMPL_REG_END
> +};
> +
>  static const struct sample_reg sample_reg_masks[] = {
>         SMPL_REG(AX, PERF_REG_X86_AX),
>         SMPL_REG(BX, PERF_REG_X86_BX),
> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>         return SDT_ARG_VALID;
>  }
>
> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)

To make the code easier to read, it'd be nice to document sample_type,
qwords and mask here.

> +{
> +       struct perf_event_attr attr = {
> +               .type                           = PERF_TYPE_HARDWARE,
> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> +               .sample_type                    = sample_type,
> +               .disabled                       = 1,
> +               .exclude_kernel                 = 1,
> +               .sample_simd_regs_enabled       = 1,
> +       };
> +       int fd;
> +
> +       attr.sample_period = 1;
> +
> +       if (!pred) {
> +               attr.sample_simd_vec_reg_qwords = qwords;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       attr.sample_simd_vec_reg_intr = mask;
> +               else
> +                       attr.sample_simd_vec_reg_user = mask;
> +       } else {
> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> +               else
> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> +       }
> +
> +       if (perf_pmus__num_core_pmus() > 1) {
> +               struct perf_pmu *pmu = NULL;
> +               __u64 type = PERF_TYPE_RAW;

It should be okay to do:
__u64 type = perf_pmus__find_core_pmu()->type
rather than have the whole loop below.

> +
> +               /*
> +                * The same register set is supported among different hybrid PMUs.
> +                * Only check the first available one.
> +                */
> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> +                       type = pmu->type;
> +                       break;
> +               }
> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> +       }
> +
> +       event_attr_init(&attr);
> +
> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> +       if (fd != -1) {
> +               close(fd);
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> +{
> +       bool supported = false;
> +       u64 bits;
> +
> +       *mask = 0;
> +       *qwords = 0;
> +
> +       switch (reg) {
> +       case PERF_REG_X86_XMM:
> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_XMM_QWORDS;
> +               }
> +               break;
> +       case PERF_REG_X86_YMM:
> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_YMM_QWORDS;
> +               }
> +               break;
> +       case PERF_REG_X86_ZMM:
> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_ZMM_QWORDS;
> +                       break;
> +               }
> +
> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> +               }
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       return supported;
> +}
> +
> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> +{
> +       bool supported = false;
> +       u64 bits;
> +
> +       *mask = 0;
> +       *qwords = 0;
> +
> +       switch (reg) {
> +       case PERF_REG_X86_OPMASK:
> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> +               }
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       return supported;
> +}
> +
> +static bool has_cap_simd_regs(void)
> +{
> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> +       u16 qwords = PERF_X86_XMM_QWORDS;
> +       static bool has_cap_simd_regs;
> +       static bool cached;
> +
> +       if (cached)
> +               return has_cap_simd_regs;
> +
> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> +       cached = true;
> +
> +       return has_cap_simd_regs;
> +}
> +
> +bool arch_has_simd_regs(u64 mask)
> +{
> +       return has_cap_simd_regs() &&
> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> +}
> +
> +static const struct sample_reg sample_simd_reg_masks[] = {
> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> +       SMPL_REG_END
> +};
> +
> +static const struct sample_reg sample_pred_reg_masks[] = {
> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> +       SMPL_REG_END
> +};
> +
> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> +{
> +       return sample_simd_reg_masks;
> +}
> +
> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> +{
> +       return sample_pred_reg_masks;
> +}
> +
> +static bool x86_intr_simd_updated;
> +static u64 x86_intr_simd_reg_mask;
> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];

Could we add some comments? I can kind of figure out the updated is a
check for lazy initialization and what masks are, qwords is an odd
one. The comment could also point out that SIMD doesn't mean the
machine supports SIMD, but SIMD registers are supported in perf
events.

> +static bool x86_user_simd_updated;
> +static u64 x86_user_simd_reg_mask;
> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> +
> +static bool x86_intr_pred_updated;
> +static u64 x86_intr_pred_reg_mask;
> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> +static bool x86_user_pred_updated;
> +static u64 x86_user_pred_reg_mask;
> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> +
> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool supported;
> +       u64 mask = 0;
> +       int reg;
> +
> +       if (!has_cap_simd_regs())
> +               return 0;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> +               return x86_intr_simd_reg_mask;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> +               return x86_user_simd_reg_mask;
> +
> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +               supported = false;
> +
> +               if (!r->mask)
> +                       continue;
> +               reg = fls64(r->mask) - 1;
> +
> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> +                       break;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> +                                                        &x86_intr_simd_mask[reg],
> +                                                        &x86_intr_simd_qwords[reg]);
> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> +                                                        &x86_user_simd_mask[reg],
> +                                                        &x86_user_simd_qwords[reg]);
> +               if (supported)
> +                       mask |= BIT_ULL(reg);
> +       }
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> +               x86_intr_simd_reg_mask = mask;
> +               x86_intr_simd_updated = true;
> +       } else {
> +               x86_user_simd_reg_mask = mask;
> +               x86_user_simd_updated = true;
> +       }
> +
> +       return mask;
> +}
> +
> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool supported;
> +       u64 mask = 0;
> +       int reg;
> +
> +       if (!has_cap_simd_regs())
> +               return 0;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> +               return x86_intr_pred_reg_mask;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> +               return x86_user_pred_reg_mask;
> +
> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> +               supported = false;
> +
> +               if (!r->mask)
> +                       continue;
> +               reg = fls64(r->mask) - 1;
> +
> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> +                       break;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> +                                                        &x86_intr_pred_mask[reg],
> +                                                        &x86_intr_pred_qwords[reg]);
> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> +                                                        &x86_user_pred_mask[reg],
> +                                                        &x86_user_pred_qwords[reg]);
> +               if (supported)
> +                       mask |= BIT_ULL(reg);
> +       }
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> +               x86_intr_pred_reg_mask = mask;
> +               x86_intr_pred_updated = true;
> +       } else {
> +               x86_user_pred_reg_mask = mask;
> +               x86_user_pred_updated = true;
> +       }
> +
> +       return mask;
> +}

This feels repetitive with __arch__simd_reg_mask, could they be
refactored together?

> +
> +uint64_t arch__intr_simd_reg_mask(void)
> +{
> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> +}
> +
> +uint64_t arch__user_simd_reg_mask(void)
> +{
> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> +}
> +
> +uint64_t arch__intr_pred_reg_mask(void)
> +{
> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> +}
> +
> +uint64_t arch__user_pred_reg_mask(void)
> +{
> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> +}
> +
> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> +{
> +       uint64_t mask = 0;
> +
> +       *qwords = 0;
> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> +               if (intr) {
> +                       *qwords = x86_intr_simd_qwords[reg];
> +                       mask = x86_intr_simd_mask[reg];
> +               } else {
> +                       *qwords = x86_user_simd_qwords[reg];
> +                       mask = x86_user_simd_mask[reg];
> +               }
> +       }
> +
> +       return mask;
> +}
> +
> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> +{
> +       uint64_t mask = 0;
> +
> +       *qwords = 0;
> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> +               if (intr) {
> +                       *qwords = x86_intr_pred_qwords[reg];
> +                       mask = x86_intr_pred_mask[reg];
> +               } else {
> +                       *qwords = x86_user_pred_qwords[reg];
> +                       mask = x86_user_pred_mask[reg];
> +               }
> +       }
> +
> +       return mask;
> +}
> +
> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_intr_simd_updated)
> +               arch__intr_simd_reg_mask();
> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> +}
> +
> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_user_simd_updated)
> +               arch__user_simd_reg_mask();
> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> +}
> +
> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_intr_pred_updated)
> +               arch__intr_pred_reg_mask();
> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> +}
> +
> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_user_pred_updated)
> +               arch__user_pred_reg_mask();
> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> +}
> +
>  const struct sample_reg *arch__sample_reg_masks(void)
>  {
> +       if (has_cap_simd_regs())
> +               return sample_reg_masks_ext;
>         return sample_reg_masks;
>  }
>
> -uint64_t arch__intr_reg_mask(void)
> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>  {
>         struct perf_event_attr attr = {
> -               .type                   = PERF_TYPE_HARDWARE,
> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> -               .precise_ip             = 1,
> -               .disabled               = 1,
> -               .exclude_kernel         = 1,
> +               .type                           = PERF_TYPE_HARDWARE,
> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> +               .sample_type                    = sample_type,
> +               .precise_ip                     = 1,
> +               .disabled                       = 1,
> +               .exclude_kernel                 = 1,
> +               .sample_simd_regs_enabled       = has_simd_regs,
>         };
>         int fd;
>         /*
>          * In an unnamed union, init it here to build on older gcc versions
>          */
>         attr.sample_period = 1;
> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> +               attr.sample_regs_intr = mask;
> +       else
> +               attr.sample_regs_user = mask;
>
>         if (perf_pmus__num_core_pmus() > 1) {
>                 struct perf_pmu *pmu = NULL;
> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>         if (fd != -1) {
>                 close(fd);
> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> +               return mask;
>         }
>
> -       return PERF_REGS_MASK;
> +       return 0;
> +}
> +
> +uint64_t arch__intr_reg_mask(void)
> +{
> +       uint64_t mask = PERF_REGS_MASK;
> +
> +       if (has_cap_simd_regs()) {
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> +                                        true);

It's nice to label constant arguments like this something like:
/*has_simd_regs=*/true);

Tools like clang-tidy even try to enforce the argument names match the comments.

> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> +                                        BIT_ULL(PERF_REG_X86_SSP),
> +                                        true);
> +       } else
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> +
> +       return mask;
>  }
>
>  uint64_t arch__user_reg_mask(void)
>  {
> -       return PERF_REGS_MASK;
> +       uint64_t mask = PERF_REGS_MASK;
> +
> +       if (has_cap_simd_regs()) {
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> +                                        true);
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> +                                        BIT_ULL(PERF_REG_X86_SSP),
> +                                        true);
> +       }
> +
> +       return mask;

The code is repetitive here, could we refactor into a single function
passing in a user or instr value?

>  }
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 56ebefd075f2..5d1d90cf9488 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>             !evsel__is_dummy_event(evsel)) {
>                 attr->sample_regs_intr = opts->sample_intr_regs;
> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> +               evsel__set_sample_bit(evsel, REGS_INTR);
> +       }
> +
> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> +               /* The pred qwords is to implies the set of SIMD registers is used */
> +               if (opts->sample_pred_regs_qwords)
> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> +               else
> +                       attr->sample_simd_pred_reg_qwords = 1;
> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>                 evsel__set_sample_bit(evsel, REGS_INTR);
>         }
>
>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>             !evsel__is_dummy_event(evsel)) {
>                 attr->sample_regs_user |= opts->sample_user_regs;
> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> +               evsel__set_sample_bit(evsel, REGS_USER);
> +       }
> +
> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> +               if (opts->sample_pred_regs_qwords)
> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> +               else
> +                       attr->sample_simd_pred_reg_qwords = 1;
> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>                 evsel__set_sample_bit(evsel, REGS_USER);
>         }
>
> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> index cda1c620968e..0bd100392889 100644
> --- a/tools/perf/util/parse-regs-options.c
> +++ b/tools/perf/util/parse-regs-options.c
> @@ -4,19 +4,139 @@
>  #include <stdint.h>
>  #include <string.h>
>  #include <stdio.h>
> +#include <linux/bitops.h>
>  #include "util/debug.h"
>  #include <subcmd/parse-options.h>
>  #include "util/perf_regs.h"
>  #include "util/parse-regs-options.h"
> +#include "record.h"
> +
> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> +{
> +       const struct sample_reg *r = NULL;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       if (!simd_mask)
> +               return;
> +
> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +               if (!(r->mask & simd_mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               if (bitmap)
> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> +       }
> +}
> +
> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> +{
> +       const struct sample_reg *r = NULL;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       if (!pred_mask)
> +               return;
> +
> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> +               if (!(r->mask & pred_mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               if (bitmap)
> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> +       }
> +}
> +
> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool matched = false;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +               if (strcasecmp(s, r->name))
> +                       continue;
> +               if (!fls64(r->mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               matched = true;
> +               break;
> +       }
> +
> +       /* Just need the highest qwords */

I'm not following here. Does the bitmap need to handle gaps?

> +       if (qwords > opts->sample_vec_regs_qwords) {
> +               opts->sample_vec_regs_qwords = qwords;
> +               if (intr)
> +                       opts->sample_intr_vec_regs = bitmap;
> +               else
> +                       opts->sample_user_vec_regs = bitmap;
> +       }
> +
> +       return matched;
> +}
> +
> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool matched = false;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> +               if (strcasecmp(s, r->name))
> +                       continue;
> +               if (!fls64(r->mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               matched = true;
> +               break;
> +       }
> +
> +       /* Just need the highest qwords */

Again repetitive, could we have a single function?

> +       if (qwords > opts->sample_pred_regs_qwords) {
> +               opts->sample_pred_regs_qwords = qwords;
> +               if (intr)
> +                       opts->sample_intr_pred_regs = bitmap;
> +               else
> +                       opts->sample_user_pred_regs = bitmap;
> +       }
> +
> +       return matched;
> +}
>
>  static int
>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>  {
>         uint64_t *mode = (uint64_t *)opt->value;
>         const struct sample_reg *r = NULL;
> +       struct record_opts *opts;
>         char *s, *os = NULL, *p;
> -       int ret = -1;
> +       bool has_simd_regs = false;
>         uint64_t mask;
> +       uint64_t simd_mask;
> +       uint64_t pred_mask;
> +       int ret = -1;
>
>         if (unset)
>                 return 0;
> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>         if (*mode)
>                 return -1;
>
> -       if (intr)
> +       if (intr) {
> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>                 mask = arch__intr_reg_mask();
> -       else
> +               simd_mask = arch__intr_simd_reg_mask();
> +               pred_mask = arch__intr_pred_reg_mask();
> +       } else {
> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>                 mask = arch__user_reg_mask();
> +               simd_mask = arch__user_simd_reg_mask();
> +               pred_mask = arch__user_pred_reg_mask();
> +       }
>
>         /* str may be NULL in case no arg is passed to -I */
>         if (str) {
> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>                                         if (r->mask & mask)
>                                                 fprintf(stderr, "%s ", r->name);
>                                 }
> +                               __print_simd_regs(intr, simd_mask);
> +                               __print_pred_regs(intr, pred_mask);
>                                 fputc('\n', stderr);
>                                 /* just printing available regs */
>                                 goto error;
>                         }
> +
> +                       if (simd_mask) {
> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> +                               if (has_simd_regs)
> +                                       goto next;
> +                       }
> +                       if (pred_mask) {
> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> +                               if (has_simd_regs)
> +                                       goto next;
> +                       }
> +
>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>                                         break;
> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>                         }
>
>                         *mode |= r->mask;
> -
> +next:
>                         if (!p)
>                                 break;
>
> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>         ret = 0;
>
>         /* default to all possible regs */
> -       if (*mode == 0)
> +       if (*mode == 0 && !has_simd_regs)
>                 *mode = mask;
>  error:
>         free(os);
> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> index 66b666d9ce64..fb0366d050cf 100644
> --- a/tools/perf/util/perf_event_attr_fprintf.c
> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>         PRINT_ATTRf(aux_pause, p_unsigned);
>         PRINT_ATTRf(aux_resume, p_unsigned);
> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>
>         return ret;
>  }
> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> index 44b90bbf2d07..e8a9fabc92e6 100644
> --- a/tools/perf/util/perf_regs.c
> +++ b/tools/perf/util/perf_regs.c
> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>         return SDT_ARG_SKIP;
>  }
>
> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> +{
> +       return false;
> +}
> +
>  uint64_t __weak arch__intr_reg_mask(void)
>  {
>         return 0;
> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>         return 0;
>  }
>
> +uint64_t __weak arch__intr_simd_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_simd_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__intr_pred_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_pred_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
>  static const struct sample_reg sample_reg_masks[] = {
>         SMPL_REG_END
>  };
> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>         return sample_reg_masks;
>  }
>
> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> +{
> +       return sample_reg_masks;
> +}
> +
> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> +{
> +       return sample_reg_masks;
> +}

Thinking out loud. I wonder if there is a way to hide the weak
functions. It seems the support is tied to PMUs, particularly core
PMUs, perhaps we can push things into pmu and arch pmu code. Then we
ask the PMU to parse the register strings, set up the perf_event_attr,
etc. I'm somewhat scared these functions will be used on the report
rather than record side of things, thereby breaking perf.data support
when the host kernel does or doesn't have the SIMD support.

Thanks,
Ian

> +
>  const char *perf_reg_name(int id, const char *arch)
>  {
>         const char *reg_name = NULL;
> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> index f2d0736d65cc..bce9c4cfd1bf 100644
> --- a/tools/perf/util/perf_regs.h
> +++ b/tools/perf/util/perf_regs.h
> @@ -24,9 +24,20 @@ enum {
>  };
>
>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> +bool arch_has_simd_regs(u64 mask);
>  uint64_t arch__intr_reg_mask(void);
>  uint64_t arch__user_reg_mask(void);
>  const struct sample_reg *arch__sample_reg_masks(void);
> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> +uint64_t arch__intr_simd_reg_mask(void);
> +uint64_t arch__user_simd_reg_mask(void);
> +uint64_t arch__intr_pred_reg_mask(void);
> +uint64_t arch__user_pred_reg_mask(void);
> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>
>  const char *perf_reg_name(int id, const char *arch);
>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> index ea3a6c4657ee..825ffb4cc53f 100644
> --- a/tools/perf/util/record.h
> +++ b/tools/perf/util/record.h
> @@ -59,7 +59,13 @@ struct record_opts {
>         unsigned int  user_freq;
>         u64           branch_stack;
>         u64           sample_intr_regs;
> +       u64           sample_intr_vec_regs;
>         u64           sample_user_regs;
> +       u64           sample_user_vec_regs;
> +       u16           sample_pred_regs_qwords;
> +       u16           sample_vec_regs_qwords;
> +       u16           sample_intr_pred_regs;
> +       u16           sample_user_pred_regs;
>         u64           default_interval;
>         u64           user_interval;
>         size_t        auxtrace_snapshot_size;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (18 preceding siblings ...)
  2025-12-03  6:55 ` [Patch v5 19/19] perf regs: Enable dumping of SIMD registers Dapeng Mi
@ 2025-12-04  0:24 ` Ian Rogers
  2025-12-04  3:28   ` Mi, Dapeng
  2025-12-16  4:42 ` Ravi Bangoria
  20 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2025-12-04  0:24 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

On Tue, Dec 2, 2025 at 10:58 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>
> Changes since V4:
> - Rewrite some functions comments and commit messages (Dave)
> - Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
> - Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
>   activating back-to-back NMI detection mechanism (Patch 16/19)
> - Fix some minor issues on perf-tool patches (Patch 18/19)
>
> Changes since V3:
> - Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
> - Only dump the available regs, rather than zero and dump the
>   unavailable regs. It's possible that the dumped registers are a subset
>   of the requested registers.
> - Some minor updates to address Dapeng's comments in V3.
>
> Changes since V2:
> - Use the FPU format for the x86_pmu.ext_regs_mask as well
> - Add a check before invoking xsaves_nmi()
> - Add perf_simd_reg_check() to retrieve the number of available
>   registers. If the kernel fails to get the requested registers, e.g.,
>   XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
> - Add POC perf tool patches
>
> Changes since V1:
> - Apply the new interfaces to configure and dump the SIMD registers
> - Utilize the existing FPU functions, e.g., xstate_calculate_size,
>   get_xsave_addr().
>
> Starting from Intel Ice Lake, XMM registers can be collected in a PEBS
> record. Future Architecture PEBS will include additional registers such
> as YMM, ZMM, OPMASK, SSP and APX eGPRs, contingent on hardware support.
>
> This patch set introduces a software solution to mitigate the hardware
> requirement by utilizing the XSAVES command to retrieve the requested
> registers in the overflow handler. This feature is no longer limited to
> PEBS events or specific platforms. While the hardware solution remains
> preferable due to its lower overhead and higher accuracy, this software
> approach provides a viable alternative.
>
> The solution is theoretically compatible with all x86 platforms but is
> currently enabled on newer platforms, including Sapphire Rapids and
> later P-core server platforms, Sierra Forest and later E-core server
> platforms and recent Client platforms, like Arrow Lake, Panther Lake and
> Nova Lake.
>
> Newly supported registers include YMM, ZMM, OPMASK, SSP, and APX eGPRs.
> Due to space constraints in sample_regs_user/intr, new fields have been
> introduced in the perf_event_attr structure to accommodate these
> registers.
>
> After a long discussion in V1,
> https://lore.kernel.org/lkml/3f1c9a9e-cb63-47ff-a5e9-06555fa6cc9a@linux.intel.com/
> The below new fields are introduced.
>
> @@ -543,6 +545,25 @@ struct perf_event_attr {
>         __u64   sig_data;
>
>         __u64   config3; /* extension of config2 */
> +
> +
> +       /*
> +        * Defines set of SIMD registers to dump on samples.
> +        * The sample_simd_regs_enabled !=0 implies the
> +        * set of SIMD registers is used to config all SIMD registers.
> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> +        * config some SIMD registers on X86.
> +        */
> +       union {
> +               __u16 sample_simd_regs_enabled;
> +               __u16 sample_simd_pred_reg_qwords;
> +       };
> +       __u32 sample_simd_pred_reg_intr;
> +       __u32 sample_simd_pred_reg_user;
> +       __u16 sample_simd_vec_reg_qwords;
> +       __u64 sample_simd_vec_reg_intr;
> +       __u64 sample_simd_vec_reg_user;
> +       __u32 __reserved_4;
>  };
> @@ -1016,7 +1037,15 @@ enum perf_event_type {
>          *      } && PERF_SAMPLE_BRANCH_STACK
>          *
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_USER
>          *
>          *      { u64                   size;
>          *        char                  data[size];
> @@ -1043,7 +1072,15 @@ enum perf_event_type {
>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_INTR
>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>
>
> To maintain simplicity, a single field, sample_{simd|pred}_vec_reg_qwords,
> is introduced to indicate register width. For example:
> - sample_simd_vec_reg_qwords = 2 for XMM registers (128 bits) on x86
> - sample_simd_vec_reg_qwords = 4 for YMM registers (256 bits) on x86
>
> Four additional fields, sample_{simd|pred}_vec_reg_{intr|user}, represent
> the bitmap of sampling registers. For instance, the bitmap for x86
> XMM registers is 0xffff (16 XMM registers). Although users can
> theoretically sample a subset of registers, the current perf-tool
> implementation supports sampling all registers of each type to avoid
> complexity.
>
> A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is introduced to signal user space
> tools about the presence of SIMD registers in sampling records. When this
> flag is detected, tools should recognize that extra SIMD register data
> follows the general register data. The layout of the extra SIMD register
> data is displayed as follow.
>
>    u16 nr_vectors;
>    u16 vector_qwords;
>    u16 nr_pred;
>    u16 pred_qwords;
>    u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>
> With this patch set, sampling for the aforementioned registers is
> supported on the Intel Nova Lake platform.
>
> Examples:
>  $perf record -I?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

nit: It seems strange in this output to mix ranges like "XMM0-15" but
then list out "R8....R31". That said we have tests that explicitly
look for the non-range pattern:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell/record.sh?h=perf-tools-next#n106

Thanks,
Ian

>  $perf record --user-regs=?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
>  $perf record -e branches:p -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -c 100000 ./test
>  $perf report -D
>
>  ... ...
>  14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
>  0xffffffff9f085e24 period: 100000 addr: 0
>  ... intr regs: mask 0x18001010003 ABI 64-bit
>  .... AX    0xdffffc0000000000
>  .... BX    0xffff8882297685e8
>  .... R8    0x0000000000000000
>  .... R16   0x0000000000000000
>  .... R31   0x0000000000000000
>  .... SSP   0x0000000000000000
>  ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
>  .... ZMM  [0] 0xffffffffffffffff
>  .... ZMM  [0] 0x0000000000000001
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [0] 0x0000000000000000
>  .... ZMM  [1] 0x003a6b6165506d56
>  ... ...
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... ZMM  [31] 0x0000000000000000
>  .... OPMASK[0] 0x00000000fffffe00
>  .... OPMASK[1] 0x0000000000ffffff
>  .... OPMASK[2] 0x000000000000007f
>  .... OPMASK[3] 0x0000000000000000
>  .... OPMASK[4] 0x0000000000010080
>  .... OPMASK[5] 0x0000000000000000
>  .... OPMASK[6] 0x0000400004000000
>  .... OPMASK[7] 0x0000000000000000
>  ... ...
>
>
> History:
>   v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
>   v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
>   v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
>   v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/
>
> Dapeng Mi (3):
>   perf: Eliminate duplicate arch-specific functions definations
>   perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling
>   perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
>     NMIs
>
> Kan Liang (16):
>   perf/x86: Use x86_perf_regs in the x86 nmi handler
>   perf/x86: Introduce x86-specific x86_pmu_setup_regs_data()
>   x86/fpu/xstate: Add xsaves_nmi() helper
>   perf: Move and rename has_extended_regs() for ARCH-specific use
>   perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
>   perf: Add sampling support for SIMD registers
>   perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Enable YMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Enable ZMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields
>   perf/x86: Enable eGPRs sampling using sample_regs_* fields
>   perf/x86: Enable SSP sampling using sample_regs_* fields
>   perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
>   perf headers: Sync with the kernel headers
>   perf parse-regs: Support new SIMD sampling format
>   perf regs: Enable dumping of SIMD registers
>
>  arch/arm/kernel/perf_regs.c                   |   8 +-
>  arch/arm64/kernel/perf_regs.c                 |   8 +-
>  arch/csky/kernel/perf_regs.c                  |   8 +-
>  arch/loongarch/kernel/perf_regs.c             |   8 +-
>  arch/mips/kernel/perf_regs.c                  |   8 +-
>  arch/parisc/kernel/perf_regs.c                |   8 +-
>  arch/powerpc/perf/perf_regs.c                 |   2 +-
>  arch/riscv/kernel/perf_regs.c                 |   8 +-
>  arch/s390/kernel/perf_regs.c                  |   2 +-
>  arch/x86/events/core.c                        | 326 +++++++++++-
>  arch/x86/events/intel/core.c                  | 117 ++++-
>  arch/x86/events/intel/ds.c                    | 134 ++++-
>  arch/x86/events/perf_event.h                  |  85 +++-
>  arch/x86/include/asm/fpu/xstate.h             |   3 +
>  arch/x86/include/asm/msr-index.h              |   7 +
>  arch/x86/include/asm/perf_event.h             |  38 +-
>  arch/x86/include/uapi/asm/perf_regs.h         |  62 +++
>  arch/x86/kernel/fpu/xstate.c                  |  25 +-
>  arch/x86/kernel/perf_regs.c                   | 131 ++++-
>  include/linux/perf_event.h                    |  16 +
>  include/linux/perf_regs.h                     |  36 +-
>  include/uapi/linux/perf_event.h               |  45 +-
>  kernel/events/core.c                          | 132 ++++-
>  tools/arch/x86/include/uapi/asm/perf_regs.h   |  62 +++
>  tools/include/uapi/linux/perf_event.h         |  45 +-
>  tools/perf/arch/x86/util/perf_regs.c          | 470 +++++++++++++++++-
>  tools/perf/util/evsel.c                       |  47 ++
>  tools/perf/util/parse-regs-options.c          | 151 +++++-
>  .../perf/util/perf-regs-arch/perf_regs_x86.c  |  43 ++
>  tools/perf/util/perf_event_attr_fprintf.c     |   6 +
>  tools/perf/util/perf_regs.c                   |  59 +++
>  tools/perf/util/perf_regs.h                   |  11 +
>  tools/perf/util/record.h                      |   6 +
>  tools/perf/util/sample.h                      |  10 +
>  tools/perf/util/session.c                     |  78 ++-
>  35 files changed, 2012 insertions(+), 193 deletions(-)
>
>
> base-commit: 9929dffce5ed7e2988e0274f4db98035508b16d9
> prerequisite-patch-id: a15bcd62a8dcd219d17489eef88b66ea5488a2a0
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2025-12-03 23:43   ` Ian Rogers
@ 2025-12-04  1:37     ` Mi, Dapeng
  2025-12-04  7:28       ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-04  1:37 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/4/2025 7:43 AM, Ian Rogers wrote:
> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Update include/uapi/linux/perf_event.h and
>> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
>>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
>>  2 files changed, 103 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
>> index 7c9d2bb3833b..f3561ed10041 100644
>> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
>> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
>> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
>>         PERF_REG_X86_R13,
>>         PERF_REG_X86_R14,
>>         PERF_REG_X86_R15,
>> +       /*
>> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
>> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
>> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
>> +        *
>> +        * Extended GPRs (EGPRs)
>> +        */
>> +       PERF_REG_X86_R16,
>> +       PERF_REG_X86_R17,
>> +       PERF_REG_X86_R18,
>> +       PERF_REG_X86_R19,
>> +       PERF_REG_X86_R20,
>> +       PERF_REG_X86_R21,
>> +       PERF_REG_X86_R22,
>> +       PERF_REG_X86_R23,
>> +       PERF_REG_X86_R24,
>> +       PERF_REG_X86_R25,
>> +       PERF_REG_X86_R26,
>> +       PERF_REG_X86_R27,
>> +       PERF_REG_X86_R28,
>> +       PERF_REG_X86_R29,
>> +       PERF_REG_X86_R30,
>> +       PERF_REG_X86_R31,
>> +       PERF_REG_X86_SSP,
>>         /* These are the limits for the GPRs. */
>>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
>> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
> I wonder MISC isn't the most intention revealing name. What happens if
> things are extended again? Would APX be a better alternative, so
> PERF_REG_APX_MAX ?

Hmm, I don't think PERF_REG_APX_MAX is a good name either since there is
SSP as well besides the APX eGPRs, and there could be more registers
introduced in the future.

How about PERF_REG_X86_EXTD_MAX?


>
>>         /* These all need two bits set because they are 128bit */
>>         PERF_REG_X86_XMM0  = 32,
>> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
>>  };
>>
>>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
>> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
>> +
>> +enum {
>> +       PERF_REG_X86_XMM,
>> +       PERF_REG_X86_YMM,
>> +       PERF_REG_X86_ZMM,
>> +       PERF_REG_X86_MAX_SIMD_REGS,
>> +
>> +       PERF_REG_X86_OPMASK = 0,
>> +       PERF_REG_X86_MAX_PRED_REGS = 1,
>> +};
>> +
>> +enum {
>> +       PERF_X86_SIMD_XMM_REGS      = 16,
>> +       PERF_X86_SIMD_YMM_REGS      = 16,
>> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
>> +       PERF_X86_SIMD_ZMM_REGS      = 32,
>> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
>> +
>> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
>> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
>> +};
>> +
>> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
>> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
>> +
>> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
>> +
>> +enum {
>> +       PERF_X86_OPMASK_QWORDS   = 1,
>> +       PERF_X86_XMM_QWORDS      = 2,
>> +       PERF_X86_YMMH_QWORDS     = 2,
>> +       PERF_X86_YMM_QWORDS      = 4,
>> +       PERF_X86_ZMMH_QWORDS     = 4,
>> +       PERF_X86_ZMM_QWORDS      = 8,
>> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
>> +};
>>
>>  #endif /* _ASM_X86_PERF_REGS_H */
>> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
>> index d292f96bc06f..f1474da32622 100644
>> --- a/tools/include/uapi/linux/perf_event.h
>> +++ b/tools/include/uapi/linux/perf_event.h
>> @@ -314,8 +314,9 @@ enum {
>>   */
>>  enum perf_sample_regs_abi {
>>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
>> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
>> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
>> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
>> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
>> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
>>  };
>>
>>  /*
>> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
>>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
>>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
>> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
> ARM have added a config4 in:
> https://lore.kernel.org/lkml/20251111-james-perf-feat_spe_eft-v10-1-1e1b5bf2cd05@linaro.org/
> so this will need to be VER10.

Thanks. It looks the ARM changes have been merged, so we can change it to
VER10 in next version.


>
> Thanks,
> Ian
>
>>  /*
>>   * 'struct perf_event_attr' contains various attributes that define
>> @@ -545,6 +547,25 @@ struct perf_event_attr {
>>         __u64   sig_data;
>>
>>         __u64   config3; /* extension of config2 */
>> +
>> +
>> +       /*
>> +        * Defines set of SIMD registers to dump on samples.
>> +        * The sample_simd_regs_enabled !=0 implies the
>> +        * set of SIMD registers is used to config all SIMD registers.
>> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
>> +        * config some SIMD registers on X86.
>> +        */
>> +       union {
>> +               __u16 sample_simd_regs_enabled;
>> +               __u16 sample_simd_pred_reg_qwords;
>> +       };
>> +       __u32 sample_simd_pred_reg_intr;
>> +       __u32 sample_simd_pred_reg_user;
>> +       __u16 sample_simd_vec_reg_qwords;
>> +       __u64 sample_simd_vec_reg_intr;
>> +       __u64 sample_simd_vec_reg_user;
>> +       __u32 __reserved_4;
>>  };
>>
>>  /*
>> @@ -1018,7 +1039,15 @@ enum perf_event_type {
>>          *      } && PERF_SAMPLE_BRANCH_STACK
>>          *
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +        *      } && PERF_SAMPLE_REGS_USER
>>          *
>>          *      { u64                   size;
>>          *        char                  data[size];
>> @@ -1045,7 +1074,15 @@ enum perf_event_type {
>>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +        *      } && PERF_SAMPLE_REGS_INTR
>>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-04  0:17   ` Ian Rogers
@ 2025-12-04  2:58     ` Mi, Dapeng
  2025-12-04  7:49       ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-04  2:58 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/4/2025 8:17 AM, Ian Rogers wrote:
> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> This patch adds support for the newly introduced SIMD register sampling
>> format by adding the following functions:
>>
>> uint64_t arch__intr_simd_reg_mask(void);
>> uint64_t arch__user_simd_reg_mask(void);
>> uint64_t arch__intr_pred_reg_mask(void);
>> uint64_t arch__user_pred_reg_mask(void);
>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>
>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>
>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>> supported PRED registers, such as OPMASK on x86 platforms.
>>
>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>> exact bitmap and number of qwords for a specific type of SIMD register.
>> For example, for XMM registers on x86 platforms, the returned bitmap is
>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>
>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>> exact bitmap and number of qwords for a specific type of PRED register.
>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>> OPMASK).
>>
>> Additionally, the function __parse_regs() is enhanced to support parsing
>> these newly introduced SIMD registers. Currently, each type of register
>> can only be sampled collectively; sampling a specific SIMD register is
>> not supported. For example, all XMM registers are sampled together rather
>> than sampling only XMM0.
>>
>> When multiple overlapping register types, such as XMM and YMM, are
>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>
>> With this patch, all supported sampling registers on x86 platforms are
>> displayed as follows.
>>
>>  $perf record -I?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>
>>  $perf record --user-regs=?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>  tools/perf/util/evsel.c                   |  27 ++
>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>  tools/perf/util/perf_regs.c               |  59 +++
>>  tools/perf/util/perf_regs.h               |  11 +
>>  tools/perf/util/record.h                  |   6 +
>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>
>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>> index 12fd93f04802..db41430f3b07 100644
>> --- a/tools/perf/arch/x86/util/perf_regs.c
>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>> @@ -13,6 +13,49 @@
>>  #include "../../../util/pmu.h"
>>  #include "../../../util/pmus.h"
>>
>> +static const struct sample_reg sample_reg_masks_ext[] = {
>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>> +#endif
>> +       SMPL_REG_END
>> +};
>> +
>>  static const struct sample_reg sample_reg_masks[] = {
>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>         SMPL_REG(BX, PERF_REG_X86_BX),
>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>         return SDT_ARG_VALID;
>>  }
>>
>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> To make the code easier to read, it'd be nice to document sample_type,
> qwords and mask here.

Sure.


>
>> +{
>> +       struct perf_event_attr attr = {
>> +               .type                           = PERF_TYPE_HARDWARE,
>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>> +               .sample_type                    = sample_type,
>> +               .disabled                       = 1,
>> +               .exclude_kernel                 = 1,
>> +               .sample_simd_regs_enabled       = 1,
>> +       };
>> +       int fd;
>> +
>> +       attr.sample_period = 1;
>> +
>> +       if (!pred) {
>> +               attr.sample_simd_vec_reg_qwords = qwords;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       attr.sample_simd_vec_reg_intr = mask;
>> +               else
>> +                       attr.sample_simd_vec_reg_user = mask;
>> +       } else {
>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>> +               else
>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>> +       }
>> +
>> +       if (perf_pmus__num_core_pmus() > 1) {
>> +               struct perf_pmu *pmu = NULL;
>> +               __u64 type = PERF_TYPE_RAW;
> It should be okay to do:
> __u64 type = perf_pmus__find_core_pmu()->type
> rather than have the whole loop below.

Sure. Thanks.


>
>> +
>> +               /*
>> +                * The same register set is supported among different hybrid PMUs.
>> +                * Only check the first available one.
>> +                */
>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>> +                       type = pmu->type;
>> +                       break;
>> +               }
>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>> +       }
>> +
>> +       event_attr_init(&attr);
>> +
>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>> +       if (fd != -1) {
>> +               close(fd);
>> +               return true;
>> +       }
>> +
>> +       return false;
>> +}
>> +
>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>> +{
>> +       bool supported = false;
>> +       u64 bits;
>> +
>> +       *mask = 0;
>> +       *qwords = 0;
>> +
>> +       switch (reg) {
>> +       case PERF_REG_X86_XMM:
>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_XMM_QWORDS;
>> +               }
>> +               break;
>> +       case PERF_REG_X86_YMM:
>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_YMM_QWORDS;
>> +               }
>> +               break;
>> +       case PERF_REG_X86_ZMM:
>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>> +                       break;
>> +               }
>> +
>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>> +               }
>> +               break;
>> +       default:
>> +               break;
>> +       }
>> +
>> +       return supported;
>> +}
>> +
>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>> +{
>> +       bool supported = false;
>> +       u64 bits;
>> +
>> +       *mask = 0;
>> +       *qwords = 0;
>> +
>> +       switch (reg) {
>> +       case PERF_REG_X86_OPMASK:
>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>> +               }
>> +               break;
>> +       default:
>> +               break;
>> +       }
>> +
>> +       return supported;
>> +}
>> +
>> +static bool has_cap_simd_regs(void)
>> +{
>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>> +       static bool has_cap_simd_regs;
>> +       static bool cached;
>> +
>> +       if (cached)
>> +               return has_cap_simd_regs;
>> +
>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>> +       cached = true;
>> +
>> +       return has_cap_simd_regs;
>> +}
>> +
>> +bool arch_has_simd_regs(u64 mask)
>> +{
>> +       return has_cap_simd_regs() &&
>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>> +}
>> +
>> +static const struct sample_reg sample_simd_reg_masks[] = {
>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>> +       SMPL_REG_END
>> +};
>> +
>> +static const struct sample_reg sample_pred_reg_masks[] = {
>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>> +       SMPL_REG_END
>> +};
>> +
>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>> +{
>> +       return sample_simd_reg_masks;
>> +}
>> +
>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>> +{
>> +       return sample_pred_reg_masks;
>> +}
>> +
>> +static bool x86_intr_simd_updated;
>> +static u64 x86_intr_simd_reg_mask;
>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> Could we add some comments? I can kind of figure out the updated is a
> check for lazy initialization and what masks are, qwords is an odd
> one. The comment could also point out that SIMD doesn't mean the
> machine supports SIMD, but SIMD registers are supported in perf
> events.

Sure.


>
>> +static bool x86_user_simd_updated;
>> +static u64 x86_user_simd_reg_mask;
>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>> +
>> +static bool x86_intr_pred_updated;
>> +static u64 x86_intr_pred_reg_mask;
>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>> +static bool x86_user_pred_updated;
>> +static u64 x86_user_pred_reg_mask;
>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>> +
>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool supported;
>> +       u64 mask = 0;
>> +       int reg;
>> +
>> +       if (!has_cap_simd_regs())
>> +               return 0;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>> +               return x86_intr_simd_reg_mask;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>> +               return x86_user_simd_reg_mask;
>> +
>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>> +               supported = false;
>> +
>> +               if (!r->mask)
>> +                       continue;
>> +               reg = fls64(r->mask) - 1;
>> +
>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>> +                       break;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>> +                                                        &x86_intr_simd_mask[reg],
>> +                                                        &x86_intr_simd_qwords[reg]);
>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>> +                                                        &x86_user_simd_mask[reg],
>> +                                                        &x86_user_simd_qwords[reg]);
>> +               if (supported)
>> +                       mask |= BIT_ULL(reg);
>> +       }
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>> +               x86_intr_simd_reg_mask = mask;
>> +               x86_intr_simd_updated = true;
>> +       } else {
>> +               x86_user_simd_reg_mask = mask;
>> +               x86_user_simd_updated = true;
>> +       }
>> +
>> +       return mask;
>> +}
>> +
>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool supported;
>> +       u64 mask = 0;
>> +       int reg;
>> +
>> +       if (!has_cap_simd_regs())
>> +               return 0;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>> +               return x86_intr_pred_reg_mask;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>> +               return x86_user_pred_reg_mask;
>> +
>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>> +               supported = false;
>> +
>> +               if (!r->mask)
>> +                       continue;
>> +               reg = fls64(r->mask) - 1;
>> +
>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>> +                       break;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>> +                                                        &x86_intr_pred_mask[reg],
>> +                                                        &x86_intr_pred_qwords[reg]);
>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>> +                                                        &x86_user_pred_mask[reg],
>> +                                                        &x86_user_pred_qwords[reg]);
>> +               if (supported)
>> +                       mask |= BIT_ULL(reg);
>> +       }
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>> +               x86_intr_pred_reg_mask = mask;
>> +               x86_intr_pred_updated = true;
>> +       } else {
>> +               x86_user_pred_reg_mask = mask;
>> +               x86_user_pred_updated = true;
>> +       }
>> +
>> +       return mask;
>> +}
> This feels repetitive with __arch__simd_reg_mask, could they be
> refactored together?

hmm, it looks we can extract the for loop as a common function. The other
parts are hard to be generalized since they are manipulating different
variables. If we want to generalize them, we have to introduce lots of "if
... else" branches and that would make code hard to be read.


>
>> +
>> +uint64_t arch__intr_simd_reg_mask(void)
>> +{
>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>> +}
>> +
>> +uint64_t arch__user_simd_reg_mask(void)
>> +{
>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>> +}
>> +
>> +uint64_t arch__intr_pred_reg_mask(void)
>> +{
>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>> +}
>> +
>> +uint64_t arch__user_pred_reg_mask(void)
>> +{
>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>> +}
>> +
>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>> +{
>> +       uint64_t mask = 0;
>> +
>> +       *qwords = 0;
>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>> +               if (intr) {
>> +                       *qwords = x86_intr_simd_qwords[reg];
>> +                       mask = x86_intr_simd_mask[reg];
>> +               } else {
>> +                       *qwords = x86_user_simd_qwords[reg];
>> +                       mask = x86_user_simd_mask[reg];
>> +               }
>> +       }
>> +
>> +       return mask;
>> +}
>> +
>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>> +{
>> +       uint64_t mask = 0;
>> +
>> +       *qwords = 0;
>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>> +               if (intr) {
>> +                       *qwords = x86_intr_pred_qwords[reg];
>> +                       mask = x86_intr_pred_mask[reg];
>> +               } else {
>> +                       *qwords = x86_user_pred_qwords[reg];
>> +                       mask = x86_user_pred_mask[reg];
>> +               }
>> +       }
>> +
>> +       return mask;
>> +}
>> +
>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_intr_simd_updated)
>> +               arch__intr_simd_reg_mask();
>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>> +}
>> +
>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_user_simd_updated)
>> +               arch__user_simd_reg_mask();
>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>> +}
>> +
>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_intr_pred_updated)
>> +               arch__intr_pred_reg_mask();
>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>> +}
>> +
>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_user_pred_updated)
>> +               arch__user_pred_reg_mask();
>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>> +}
>> +
>>  const struct sample_reg *arch__sample_reg_masks(void)
>>  {
>> +       if (has_cap_simd_regs())
>> +               return sample_reg_masks_ext;
>>         return sample_reg_masks;
>>  }
>>
>> -uint64_t arch__intr_reg_mask(void)
>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>  {
>>         struct perf_event_attr attr = {
>> -               .type                   = PERF_TYPE_HARDWARE,
>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>> -               .precise_ip             = 1,
>> -               .disabled               = 1,
>> -               .exclude_kernel         = 1,
>> +               .type                           = PERF_TYPE_HARDWARE,
>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>> +               .sample_type                    = sample_type,
>> +               .precise_ip                     = 1,
>> +               .disabled                       = 1,
>> +               .exclude_kernel                 = 1,
>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>         };
>>         int fd;
>>         /*
>>          * In an unnamed union, init it here to build on older gcc versions
>>          */
>>         attr.sample_period = 1;
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +               attr.sample_regs_intr = mask;
>> +       else
>> +               attr.sample_regs_user = mask;
>>
>>         if (perf_pmus__num_core_pmus() > 1) {
>>                 struct perf_pmu *pmu = NULL;
>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>         if (fd != -1) {
>>                 close(fd);
>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>> +               return mask;
>>         }
>>
>> -       return PERF_REGS_MASK;
>> +       return 0;
>> +}
>> +
>> +uint64_t arch__intr_reg_mask(void)
>> +{
>> +       uint64_t mask = PERF_REGS_MASK;
>> +
>> +       if (has_cap_simd_regs()) {
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>> +                                        true);
> It's nice to label constant arguments like this something like:
> /*has_simd_regs=*/true);
>
> Tools like clang-tidy even try to enforce the argument names match the comments.

Sure.


>
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>> +                                        true);
>> +       } else
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>> +
>> +       return mask;
>>  }
>>
>>  uint64_t arch__user_reg_mask(void)
>>  {
>> -       return PERF_REGS_MASK;
>> +       uint64_t mask = PERF_REGS_MASK;
>> +
>> +       if (has_cap_simd_regs()) {
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>> +                                        true);
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>> +                                        true);
>> +       }
>> +
>> +       return mask;
> The code is repetitive here, could we refactor into a single function
> passing in a user or instr value?

Sure. Would extract the common part.


>
>>  }
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index 56ebefd075f2..5d1d90cf9488 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>             !evsel__is_dummy_event(evsel)) {
>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>> +       }
>> +
>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>> +               if (opts->sample_pred_regs_qwords)
>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>> +               else
>> +                       attr->sample_simd_pred_reg_qwords = 1;
>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>         }
>>
>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>             !evsel__is_dummy_event(evsel)) {
>>                 attr->sample_regs_user |= opts->sample_user_regs;
>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>> +               evsel__set_sample_bit(evsel, REGS_USER);
>> +       }
>> +
>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>> +               if (opts->sample_pred_regs_qwords)
>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>> +               else
>> +                       attr->sample_simd_pred_reg_qwords = 1;
>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>         }
>>
>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>> index cda1c620968e..0bd100392889 100644
>> --- a/tools/perf/util/parse-regs-options.c
>> +++ b/tools/perf/util/parse-regs-options.c
>> @@ -4,19 +4,139 @@
>>  #include <stdint.h>
>>  #include <string.h>
>>  #include <stdio.h>
>> +#include <linux/bitops.h>
>>  #include "util/debug.h"
>>  #include <subcmd/parse-options.h>
>>  #include "util/perf_regs.h"
>>  #include "util/parse-regs-options.h"
>> +#include "record.h"
>> +
>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       if (!simd_mask)
>> +               return;
>> +
>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>> +               if (!(r->mask & simd_mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               if (bitmap)
>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>> +       }
>> +}
>> +
>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       if (!pred_mask)
>> +               return;
>> +
>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>> +               if (!(r->mask & pred_mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               if (bitmap)
>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>> +       }
>> +}
>> +
>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool matched = false;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>> +               if (strcasecmp(s, r->name))
>> +                       continue;
>> +               if (!fls64(r->mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               matched = true;
>> +               break;
>> +       }
>> +
>> +       /* Just need the highest qwords */
> I'm not following here. Does the bitmap need to handle gaps?

Currently no. In theory, the kernel supports user space only samples a
subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
supports 16 XMM registers on XMM), but it's not supported to avoid
introducing too much complexity in perf tools. Moreover, I don't think end
users have such requirement. In most cases, users should only know which
kinds of SIMD registers their programs use but usually doesn't know and
care about which exact SIMD register is used.


>
>> +       if (qwords > opts->sample_vec_regs_qwords) {
>> +               opts->sample_vec_regs_qwords = qwords;
>> +               if (intr)
>> +                       opts->sample_intr_vec_regs = bitmap;
>> +               else
>> +                       opts->sample_user_vec_regs = bitmap;
>> +       }
>> +
>> +       return matched;
>> +}
>> +
>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool matched = false;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>> +               if (strcasecmp(s, r->name))
>> +                       continue;
>> +               if (!fls64(r->mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               matched = true;
>> +               break;
>> +       }
>> +
>> +       /* Just need the highest qwords */
> Again repetitive, could we have a single function?

Yes, I suppose the for loop at least can be extracted as a common function.


>
>> +       if (qwords > opts->sample_pred_regs_qwords) {
>> +               opts->sample_pred_regs_qwords = qwords;
>> +               if (intr)
>> +                       opts->sample_intr_pred_regs = bitmap;
>> +               else
>> +                       opts->sample_user_pred_regs = bitmap;
>> +       }
>> +
>> +       return matched;
>> +}
>>
>>  static int
>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>  {
>>         uint64_t *mode = (uint64_t *)opt->value;
>>         const struct sample_reg *r = NULL;
>> +       struct record_opts *opts;
>>         char *s, *os = NULL, *p;
>> -       int ret = -1;
>> +       bool has_simd_regs = false;
>>         uint64_t mask;
>> +       uint64_t simd_mask;
>> +       uint64_t pred_mask;
>> +       int ret = -1;
>>
>>         if (unset)
>>                 return 0;
>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>         if (*mode)
>>                 return -1;
>>
>> -       if (intr)
>> +       if (intr) {
>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>                 mask = arch__intr_reg_mask();
>> -       else
>> +               simd_mask = arch__intr_simd_reg_mask();
>> +               pred_mask = arch__intr_pred_reg_mask();
>> +       } else {
>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>                 mask = arch__user_reg_mask();
>> +               simd_mask = arch__user_simd_reg_mask();
>> +               pred_mask = arch__user_pred_reg_mask();
>> +       }
>>
>>         /* str may be NULL in case no arg is passed to -I */
>>         if (str) {
>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>                                         if (r->mask & mask)
>>                                                 fprintf(stderr, "%s ", r->name);
>>                                 }
>> +                               __print_simd_regs(intr, simd_mask);
>> +                               __print_pred_regs(intr, pred_mask);
>>                                 fputc('\n', stderr);
>>                                 /* just printing available regs */
>>                                 goto error;
>>                         }
>> +
>> +                       if (simd_mask) {
>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>> +                               if (has_simd_regs)
>> +                                       goto next;
>> +                       }
>> +                       if (pred_mask) {
>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>> +                               if (has_simd_regs)
>> +                                       goto next;
>> +                       }
>> +
>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>                                         break;
>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>                         }
>>
>>                         *mode |= r->mask;
>> -
>> +next:
>>                         if (!p)
>>                                 break;
>>
>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>         ret = 0;
>>
>>         /* default to all possible regs */
>> -       if (*mode == 0)
>> +       if (*mode == 0 && !has_simd_regs)
>>                 *mode = mask;
>>  error:
>>         free(os);
>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>> index 66b666d9ce64..fb0366d050cf 100644
>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>         PRINT_ATTRf(aux_resume, p_unsigned);
>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>
>>         return ret;
>>  }
>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>> index 44b90bbf2d07..e8a9fabc92e6 100644
>> --- a/tools/perf/util/perf_regs.c
>> +++ b/tools/perf/util/perf_regs.c
>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>         return SDT_ARG_SKIP;
>>  }
>>
>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>> +{
>> +       return false;
>> +}
>> +
>>  uint64_t __weak arch__intr_reg_mask(void)
>>  {
>>         return 0;
>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>         return 0;
>>  }
>>
>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_simd_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_pred_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>>  static const struct sample_reg sample_reg_masks[] = {
>>         SMPL_REG_END
>>  };
>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>         return sample_reg_masks;
>>  }
>>
>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>> +{
>> +       return sample_reg_masks;
>> +}
>> +
>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>> +{
>> +       return sample_reg_masks;
>> +}
> Thinking out loud. I wonder if there is a way to hide the weak
> functions. It seems the support is tied to PMUs, particularly core
> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
> ask the PMU to parse the register strings, set up the perf_event_attr,
> etc. I'm somewhat scared these functions will be used on the report
> rather than record side of things, thereby breaking perf.data support
> when the host kernel does or doesn't have the SIMD support.

Ian, I don't quite follow your words.

I don't quite understand how should we do for "push things into pmu and
arch pmu code". Current SIMD registers support follows the same way of the
general registers support. If we intend to change the way entirely, we'd
better have an independent patch-set.

why these functions would break the perf.data repport? perf-report would
check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
the flags is set (indicates there are SIMD registers data appended in the
record), perf-report would try to parse the SIMD registers data.


>
> Thanks,
> Ian
>
>> +
>>  const char *perf_reg_name(int id, const char *arch)
>>  {
>>         const char *reg_name = NULL;
>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>> index f2d0736d65cc..bce9c4cfd1bf 100644
>> --- a/tools/perf/util/perf_regs.h
>> +++ b/tools/perf/util/perf_regs.h
>> @@ -24,9 +24,20 @@ enum {
>>  };
>>
>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>> +bool arch_has_simd_regs(u64 mask);
>>  uint64_t arch__intr_reg_mask(void);
>>  uint64_t arch__user_reg_mask(void);
>>  const struct sample_reg *arch__sample_reg_masks(void);
>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>> +uint64_t arch__intr_simd_reg_mask(void);
>> +uint64_t arch__user_simd_reg_mask(void);
>> +uint64_t arch__intr_pred_reg_mask(void);
>> +uint64_t arch__user_pred_reg_mask(void);
>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>
>>  const char *perf_reg_name(int id, const char *arch);
>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>> index ea3a6c4657ee..825ffb4cc53f 100644
>> --- a/tools/perf/util/record.h
>> +++ b/tools/perf/util/record.h
>> @@ -59,7 +59,13 @@ struct record_opts {
>>         unsigned int  user_freq;
>>         u64           branch_stack;
>>         u64           sample_intr_regs;
>> +       u64           sample_intr_vec_regs;
>>         u64           sample_user_regs;
>> +       u64           sample_user_vec_regs;
>> +       u16           sample_pred_regs_qwords;
>> +       u16           sample_vec_regs_qwords;
>> +       u16           sample_intr_pred_regs;
>> +       u16           sample_user_pred_regs;
>>         u64           default_interval;
>>         u64           user_interval;
>>         size_t        auxtrace_snapshot_size;
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf
  2025-12-04  0:24 ` [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Ian Rogers
@ 2025-12-04  3:28   ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-04  3:28 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 12/4/2025 8:24 AM, Ian Rogers wrote:
> On Tue, Dec 2, 2025 at 10:58 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>> Changes since V4:
>> - Rewrite some functions comments and commit messages (Dave)
>> - Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
>> - Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
>>   activating back-to-back NMI detection mechanism (Patch 16/19)
>> - Fix some minor issues on perf-tool patches (Patch 18/19)
>>
>> Changes since V3:
>> - Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
>> - Only dump the available regs, rather than zero and dump the
>>   unavailable regs. It's possible that the dumped registers are a subset
>>   of the requested registers.
>> - Some minor updates to address Dapeng's comments in V3.
>>
>> Changes since V2:
>> - Use the FPU format for the x86_pmu.ext_regs_mask as well
>> - Add a check before invoking xsaves_nmi()
>> - Add perf_simd_reg_check() to retrieve the number of available
>>   registers. If the kernel fails to get the requested registers, e.g.,
>>   XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
>> - Add POC perf tool patches
>>
>> Changes since V1:
>> - Apply the new interfaces to configure and dump the SIMD registers
>> - Utilize the existing FPU functions, e.g., xstate_calculate_size,
>>   get_xsave_addr().
>>
>> Starting from Intel Ice Lake, XMM registers can be collected in a PEBS
>> record. Future Architecture PEBS will include additional registers such
>> as YMM, ZMM, OPMASK, SSP and APX eGPRs, contingent on hardware support.
>>
>> This patch set introduces a software solution to mitigate the hardware
>> requirement by utilizing the XSAVES command to retrieve the requested
>> registers in the overflow handler. This feature is no longer limited to
>> PEBS events or specific platforms. While the hardware solution remains
>> preferable due to its lower overhead and higher accuracy, this software
>> approach provides a viable alternative.
>>
>> The solution is theoretically compatible with all x86 platforms but is
>> currently enabled on newer platforms, including Sapphire Rapids and
>> later P-core server platforms, Sierra Forest and later E-core server
>> platforms and recent Client platforms, like Arrow Lake, Panther Lake and
>> Nova Lake.
>>
>> Newly supported registers include YMM, ZMM, OPMASK, SSP, and APX eGPRs.
>> Due to space constraints in sample_regs_user/intr, new fields have been
>> introduced in the perf_event_attr structure to accommodate these
>> registers.
>>
>> After a long discussion in V1,
>> https://lore.kernel.org/lkml/3f1c9a9e-cb63-47ff-a5e9-06555fa6cc9a@linux.intel.com/
>> The below new fields are introduced.
>>
>> @@ -543,6 +545,25 @@ struct perf_event_attr {
>>         __u64   sig_data;
>>
>>         __u64   config3; /* extension of config2 */
>> +
>> +
>> +       /*
>> +        * Defines set of SIMD registers to dump on samples.
>> +        * The sample_simd_regs_enabled !=0 implies the
>> +        * set of SIMD registers is used to config all SIMD registers.
>> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
>> +        * config some SIMD registers on X86.
>> +        */
>> +       union {
>> +               __u16 sample_simd_regs_enabled;
>> +               __u16 sample_simd_pred_reg_qwords;
>> +       };
>> +       __u32 sample_simd_pred_reg_intr;
>> +       __u32 sample_simd_pred_reg_user;
>> +       __u16 sample_simd_vec_reg_qwords;
>> +       __u64 sample_simd_vec_reg_intr;
>> +       __u64 sample_simd_vec_reg_user;
>> +       __u32 __reserved_4;
>>  };
>> @@ -1016,7 +1037,15 @@ enum perf_event_type {
>>          *      } && PERF_SAMPLE_BRANCH_STACK
>>          *
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +        *      } && PERF_SAMPLE_REGS_USER
>>          *
>>          *      { u64                   size;
>>          *        char                  data[size];
>> @@ -1043,7 +1072,15 @@ enum perf_event_type {
>>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +        *      } && PERF_SAMPLE_REGS_INTR
>>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>>
>>
>> To maintain simplicity, a single field, sample_{simd|pred}_vec_reg_qwords,
>> is introduced to indicate register width. For example:
>> - sample_simd_vec_reg_qwords = 2 for XMM registers (128 bits) on x86
>> - sample_simd_vec_reg_qwords = 4 for YMM registers (256 bits) on x86
>>
>> Four additional fields, sample_{simd|pred}_vec_reg_{intr|user}, represent
>> the bitmap of sampling registers. For instance, the bitmap for x86
>> XMM registers is 0xffff (16 XMM registers). Although users can
>> theoretically sample a subset of registers, the current perf-tool
>> implementation supports sampling all registers of each type to avoid
>> complexity.
>>
>> A new ABI, PERF_SAMPLE_REGS_ABI_SIMD, is introduced to signal user space
>> tools about the presence of SIMD registers in sampling records. When this
>> flag is detected, tools should recognize that extra SIMD register data
>> follows the general register data. The layout of the extra SIMD register
>> data is displayed as follow.
>>
>>    u16 nr_vectors;
>>    u16 vector_qwords;
>>    u16 nr_pred;
>>    u16 pred_qwords;
>>    u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>>
>> With this patch set, sampling for the aforementioned registers is
>> supported on the Intel Nova Lake platform.
>>
>> Examples:
>>  $perf record -I?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> nit: It seems strange in this output to mix ranges like "XMM0-15" but
> then list out "R8....R31". That said we have tests that explicitly
> look for the non-range pattern:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell/record.sh?h=perf-tools-next#n106

The reason that we list each GPR separately is that each GPR including (R15
~ R31) can be sampled independently although kernel reads eGPRs (R15 ~R31)
as a whole by leveraging xsaves instruction. However SIMD registers can
only be sampled and shown as a whole. 

That's why we display the registers as current format.


>
> Thanks,
> Ian
>
>>  $perf record --user-regs=?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>
>>  $perf record -e branches:p -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -c 100000 ./test
>>  $perf report -D
>>
>>  ... ...
>>  14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
>>  0xffffffff9f085e24 period: 100000 addr: 0
>>  ... intr regs: mask 0x18001010003 ABI 64-bit
>>  .... AX    0xdffffc0000000000
>>  .... BX    0xffff8882297685e8
>>  .... R8    0x0000000000000000
>>  .... R16   0x0000000000000000
>>  .... R31   0x0000000000000000
>>  .... SSP   0x0000000000000000
>>  ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
>>  .... ZMM  [0] 0xffffffffffffffff
>>  .... ZMM  [0] 0x0000000000000001
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [0] 0x0000000000000000
>>  .... ZMM  [1] 0x003a6b6165506d56
>>  ... ...
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... ZMM  [31] 0x0000000000000000
>>  .... OPMASK[0] 0x00000000fffffe00
>>  .... OPMASK[1] 0x0000000000ffffff
>>  .... OPMASK[2] 0x000000000000007f
>>  .... OPMASK[3] 0x0000000000000000
>>  .... OPMASK[4] 0x0000000000010080
>>  .... OPMASK[5] 0x0000000000000000
>>  .... OPMASK[6] 0x0000400004000000
>>  .... OPMASK[7] 0x0000000000000000
>>  ... ...
>>
>>
>> History:
>>   v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
>>   v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
>>   v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
>>   v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/
>>
>> Dapeng Mi (3):
>>   perf: Eliminate duplicate arch-specific functions definations
>>   perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling
>>   perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
>>     NMIs
>>
>> Kan Liang (16):
>>   perf/x86: Use x86_perf_regs in the x86 nmi handler
>>   perf/x86: Introduce x86-specific x86_pmu_setup_regs_data()
>>   x86/fpu/xstate: Add xsaves_nmi() helper
>>   perf: Move and rename has_extended_regs() for ARCH-specific use
>>   perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
>>   perf: Add sampling support for SIMD registers
>>   perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
>>   perf/x86: Enable YMM sampling using sample_simd_vec_reg_* fields
>>   perf/x86: Enable ZMM sampling using sample_simd_vec_reg_* fields
>>   perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields
>>   perf/x86: Enable eGPRs sampling using sample_regs_* fields
>>   perf/x86: Enable SSP sampling using sample_regs_* fields
>>   perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
>>   perf headers: Sync with the kernel headers
>>   perf parse-regs: Support new SIMD sampling format
>>   perf regs: Enable dumping of SIMD registers
>>
>>  arch/arm/kernel/perf_regs.c                   |   8 +-
>>  arch/arm64/kernel/perf_regs.c                 |   8 +-
>>  arch/csky/kernel/perf_regs.c                  |   8 +-
>>  arch/loongarch/kernel/perf_regs.c             |   8 +-
>>  arch/mips/kernel/perf_regs.c                  |   8 +-
>>  arch/parisc/kernel/perf_regs.c                |   8 +-
>>  arch/powerpc/perf/perf_regs.c                 |   2 +-
>>  arch/riscv/kernel/perf_regs.c                 |   8 +-
>>  arch/s390/kernel/perf_regs.c                  |   2 +-
>>  arch/x86/events/core.c                        | 326 +++++++++++-
>>  arch/x86/events/intel/core.c                  | 117 ++++-
>>  arch/x86/events/intel/ds.c                    | 134 ++++-
>>  arch/x86/events/perf_event.h                  |  85 +++-
>>  arch/x86/include/asm/fpu/xstate.h             |   3 +
>>  arch/x86/include/asm/msr-index.h              |   7 +
>>  arch/x86/include/asm/perf_event.h             |  38 +-
>>  arch/x86/include/uapi/asm/perf_regs.h         |  62 +++
>>  arch/x86/kernel/fpu/xstate.c                  |  25 +-
>>  arch/x86/kernel/perf_regs.c                   | 131 ++++-
>>  include/linux/perf_event.h                    |  16 +
>>  include/linux/perf_regs.h                     |  36 +-
>>  include/uapi/linux/perf_event.h               |  45 +-
>>  kernel/events/core.c                          | 132 ++++-
>>  tools/arch/x86/include/uapi/asm/perf_regs.h   |  62 +++
>>  tools/include/uapi/linux/perf_event.h         |  45 +-
>>  tools/perf/arch/x86/util/perf_regs.c          | 470 +++++++++++++++++-
>>  tools/perf/util/evsel.c                       |  47 ++
>>  tools/perf/util/parse-regs-options.c          | 151 +++++-
>>  .../perf/util/perf-regs-arch/perf_regs_x86.c  |  43 ++
>>  tools/perf/util/perf_event_attr_fprintf.c     |   6 +
>>  tools/perf/util/perf_regs.c                   |  59 +++
>>  tools/perf/util/perf_regs.h                   |  11 +
>>  tools/perf/util/record.h                      |   6 +
>>  tools/perf/util/sample.h                      |  10 +
>>  tools/perf/util/session.c                     |  78 ++-
>>  35 files changed, 2012 insertions(+), 193 deletions(-)
>>
>>
>> base-commit: 9929dffce5ed7e2988e0274f4db98035508b16d9
>> prerequisite-patch-id: a15bcd62a8dcd219d17489eef88b66ea5488a2a0
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2025-12-04  1:37     ` Mi, Dapeng
@ 2025-12-04  7:28       ` Ian Rogers
  0 siblings, 0 replies; 86+ messages in thread
From: Ian Rogers @ 2025-12-04  7:28 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 3, 2025 at 5:38 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 12/4/2025 7:43 AM, Ian Rogers wrote:
> > On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> Update include/uapi/linux/perf_event.h and
> >> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
> >>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> ---
> >>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
> >>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
> >>  2 files changed, 103 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
> >> index 7c9d2bb3833b..f3561ed10041 100644
> >> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
> >> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
> >> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
> >>         PERF_REG_X86_R13,
> >>         PERF_REG_X86_R14,
> >>         PERF_REG_X86_R15,
> >> +       /*
> >> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
> >> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
> >> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
> >> +        *
> >> +        * Extended GPRs (EGPRs)
> >> +        */
> >> +       PERF_REG_X86_R16,
> >> +       PERF_REG_X86_R17,
> >> +       PERF_REG_X86_R18,
> >> +       PERF_REG_X86_R19,
> >> +       PERF_REG_X86_R20,
> >> +       PERF_REG_X86_R21,
> >> +       PERF_REG_X86_R22,
> >> +       PERF_REG_X86_R23,
> >> +       PERF_REG_X86_R24,
> >> +       PERF_REG_X86_R25,
> >> +       PERF_REG_X86_R26,
> >> +       PERF_REG_X86_R27,
> >> +       PERF_REG_X86_R28,
> >> +       PERF_REG_X86_R29,
> >> +       PERF_REG_X86_R30,
> >> +       PERF_REG_X86_R31,
> >> +       PERF_REG_X86_SSP,
> >>         /* These are the limits for the GPRs. */
> >>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
> >>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
> >> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
> > I wonder MISC isn't the most intention revealing name. What happens if
> > things are extended again? Would APX be a better alternative, so
> > PERF_REG_APX_MAX ?
>
> Hmm, I don't think PERF_REG_APX_MAX is a good name either since there is
> SSP as well besides the APX eGPRs, and there could be more registers
> introduced in the future.
>
> How about PERF_REG_X86_EXTD_MAX?

Sounds good to me, especially with the eGPR already using the term extended.

Thanks,
Ian

> >
> >>         /* These all need two bits set because they are 128bit */
> >>         PERF_REG_X86_XMM0  = 32,
> >> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
> >>  };
> >>
> >>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
> >> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
> >> +
> >> +enum {
> >> +       PERF_REG_X86_XMM,
> >> +       PERF_REG_X86_YMM,
> >> +       PERF_REG_X86_ZMM,
> >> +       PERF_REG_X86_MAX_SIMD_REGS,
> >> +
> >> +       PERF_REG_X86_OPMASK = 0,
> >> +       PERF_REG_X86_MAX_PRED_REGS = 1,
> >> +};
> >> +
> >> +enum {
> >> +       PERF_X86_SIMD_XMM_REGS      = 16,
> >> +       PERF_X86_SIMD_YMM_REGS      = 16,
> >> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
> >> +       PERF_X86_SIMD_ZMM_REGS      = 32,
> >> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
> >> +
> >> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
> >> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
> >> +};
> >> +
> >> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
> >> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
> >> +
> >> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
> >> +
> >> +enum {
> >> +       PERF_X86_OPMASK_QWORDS   = 1,
> >> +       PERF_X86_XMM_QWORDS      = 2,
> >> +       PERF_X86_YMMH_QWORDS     = 2,
> >> +       PERF_X86_YMM_QWORDS      = 4,
> >> +       PERF_X86_ZMMH_QWORDS     = 4,
> >> +       PERF_X86_ZMM_QWORDS      = 8,
> >> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
> >> +};
> >>
> >>  #endif /* _ASM_X86_PERF_REGS_H */
> >> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> >> index d292f96bc06f..f1474da32622 100644
> >> --- a/tools/include/uapi/linux/perf_event.h
> >> +++ b/tools/include/uapi/linux/perf_event.h
> >> @@ -314,8 +314,9 @@ enum {
> >>   */
> >>  enum perf_sample_regs_abi {
> >>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
> >> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
> >> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
> >> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
> >> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
> >> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
> >>  };
> >>
> >>  /*
> >> @@ -382,6 +383,7 @@ enum perf_event_read_format {
> >>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
> >>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
> >>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
> >> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
> > ARM have added a config4 in:
> > https://lore.kernel.org/lkml/20251111-james-perf-feat_spe_eft-v10-1-1e1b5bf2cd05@linaro.org/
> > so this will need to be VER10.
>
> Thanks. It looks the ARM changes have been merged, so we can change it to
> VER10 in next version.
>
>
> >
> > Thanks,
> > Ian
> >
> >>  /*
> >>   * 'struct perf_event_attr' contains various attributes that define
> >> @@ -545,6 +547,25 @@ struct perf_event_attr {
> >>         __u64   sig_data;
> >>
> >>         __u64   config3; /* extension of config2 */
> >> +
> >> +
> >> +       /*
> >> +        * Defines set of SIMD registers to dump on samples.
> >> +        * The sample_simd_regs_enabled !=0 implies the
> >> +        * set of SIMD registers is used to config all SIMD registers.
> >> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> >> +        * config some SIMD registers on X86.
> >> +        */
> >> +       union {
> >> +               __u16 sample_simd_regs_enabled;
> >> +               __u16 sample_simd_pred_reg_qwords;
> >> +       };
> >> +       __u32 sample_simd_pred_reg_intr;
> >> +       __u32 sample_simd_pred_reg_user;
> >> +       __u16 sample_simd_vec_reg_qwords;
> >> +       __u64 sample_simd_vec_reg_intr;
> >> +       __u64 sample_simd_vec_reg_user;
> >> +       __u32 __reserved_4;
> >>  };
> >>
> >>  /*
> >> @@ -1018,7 +1039,15 @@ enum perf_event_type {
> >>          *      } && PERF_SAMPLE_BRANCH_STACK
> >>          *
> >>          *      { u64                   abi; # enum perf_sample_regs_abi
> >> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> >> +        *        u64                   regs[weight(mask)];
> >> +        *        struct {
> >> +        *              u16 nr_vectors;
> >> +        *              u16 vector_qwords;
> >> +        *              u16 nr_pred;
> >> +        *              u16 pred_qwords;
> >> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> >> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> >> +        *      } && PERF_SAMPLE_REGS_USER
> >>          *
> >>          *      { u64                   size;
> >>          *        char                  data[size];
> >> @@ -1045,7 +1074,15 @@ enum perf_event_type {
> >>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
> >>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
> >>          *      { u64                   abi; # enum perf_sample_regs_abi
> >> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> >> +        *        u64                   regs[weight(mask)];
> >> +        *        struct {
> >> +        *              u16 nr_vectors;
> >> +        *              u16 vector_qwords;
> >> +        *              u16 nr_pred;
> >> +        *              u16 pred_qwords;
> >> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> >> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> >> +        *      } && PERF_SAMPLE_REGS_INTR
> >>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
> >>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
> >>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
> >> --
> >> 2.34.1
> >>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-04  2:58     ` Mi, Dapeng
@ 2025-12-04  7:49       ` Ian Rogers
  2025-12-04  9:20         ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2025-12-04  7:49 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 12/4/2025 8:17 AM, Ian Rogers wrote:
> > On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> This patch adds support for the newly introduced SIMD register sampling
> >> format by adding the following functions:
> >>
> >> uint64_t arch__intr_simd_reg_mask(void);
> >> uint64_t arch__user_simd_reg_mask(void);
> >> uint64_t arch__intr_pred_reg_mask(void);
> >> uint64_t arch__user_pred_reg_mask(void);
> >> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>
> >> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>
> >> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >> supported PRED registers, such as OPMASK on x86 platforms.
> >>
> >> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >> exact bitmap and number of qwords for a specific type of SIMD register.
> >> For example, for XMM registers on x86 platforms, the returned bitmap is
> >> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>
> >> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >> exact bitmap and number of qwords for a specific type of PRED register.
> >> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >> OPMASK).
> >>
> >> Additionally, the function __parse_regs() is enhanced to support parsing
> >> these newly introduced SIMD registers. Currently, each type of register
> >> can only be sampled collectively; sampling a specific SIMD register is
> >> not supported. For example, all XMM registers are sampled together rather
> >> than sampling only XMM0.
> >>
> >> When multiple overlapping register types, such as XMM and YMM, are
> >> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>
> >> With this patch, all supported sampling registers on x86 platforms are
> >> displayed as follows.
> >>
> >>  $perf record -I?
> >>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>
> >>  $perf record --user-regs=?
> >>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> ---
> >>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>  tools/perf/util/evsel.c                   |  27 ++
> >>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>  tools/perf/util/perf_regs.c               |  59 +++
> >>  tools/perf/util/perf_regs.h               |  11 +
> >>  tools/perf/util/record.h                  |   6 +
> >>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >> index 12fd93f04802..db41430f3b07 100644
> >> --- a/tools/perf/arch/x86/util/perf_regs.c
> >> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >> @@ -13,6 +13,49 @@
> >>  #include "../../../util/pmu.h"
> >>  #include "../../../util/pmus.h"
> >>
> >> +static const struct sample_reg sample_reg_masks_ext[] = {
> >> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >> +#endif
> >> +       SMPL_REG_END
> >> +};
> >> +
> >>  static const struct sample_reg sample_reg_masks[] = {
> >>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>         SMPL_REG(BX, PERF_REG_X86_BX),
> >> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>         return SDT_ARG_VALID;
> >>  }
> >>
> >> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> > To make the code easier to read, it'd be nice to document sample_type,
> > qwords and mask here.
>
> Sure.
>
>
> >
> >> +{
> >> +       struct perf_event_attr attr = {
> >> +               .type                           = PERF_TYPE_HARDWARE,
> >> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >> +               .sample_type                    = sample_type,
> >> +               .disabled                       = 1,
> >> +               .exclude_kernel                 = 1,
> >> +               .sample_simd_regs_enabled       = 1,
> >> +       };
> >> +       int fd;
> >> +
> >> +       attr.sample_period = 1;
> >> +
> >> +       if (!pred) {
> >> +               attr.sample_simd_vec_reg_qwords = qwords;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       attr.sample_simd_vec_reg_intr = mask;
> >> +               else
> >> +                       attr.sample_simd_vec_reg_user = mask;
> >> +       } else {
> >> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >> +               else
> >> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >> +       }
> >> +
> >> +       if (perf_pmus__num_core_pmus() > 1) {
> >> +               struct perf_pmu *pmu = NULL;
> >> +               __u64 type = PERF_TYPE_RAW;
> > It should be okay to do:
> > __u64 type = perf_pmus__find_core_pmu()->type
> > rather than have the whole loop below.
>
> Sure. Thanks.
>
>
> >
> >> +
> >> +               /*
> >> +                * The same register set is supported among different hybrid PMUs.
> >> +                * Only check the first available one.
> >> +                */
> >> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >> +                       type = pmu->type;
> >> +                       break;
> >> +               }
> >> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >> +       }
> >> +
> >> +       event_attr_init(&attr);
> >> +
> >> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >> +       if (fd != -1) {
> >> +               close(fd);
> >> +               return true;
> >> +       }
> >> +
> >> +       return false;
> >> +}
> >> +
> >> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >> +{
> >> +       bool supported = false;
> >> +       u64 bits;
> >> +
> >> +       *mask = 0;
> >> +       *qwords = 0;
> >> +
> >> +       switch (reg) {
> >> +       case PERF_REG_X86_XMM:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_XMM_QWORDS;
> >> +               }
> >> +               break;
> >> +       case PERF_REG_X86_YMM:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_YMM_QWORDS;
> >> +               }
> >> +               break;
> >> +       case PERF_REG_X86_ZMM:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >> +                       break;
> >> +               }
> >> +
> >> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >> +               }
> >> +               break;
> >> +       default:
> >> +               break;
> >> +       }
> >> +
> >> +       return supported;
> >> +}
> >> +
> >> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >> +{
> >> +       bool supported = false;
> >> +       u64 bits;
> >> +
> >> +       *mask = 0;
> >> +       *qwords = 0;
> >> +
> >> +       switch (reg) {
> >> +       case PERF_REG_X86_OPMASK:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >> +               }
> >> +               break;
> >> +       default:
> >> +               break;
> >> +       }
> >> +
> >> +       return supported;
> >> +}
> >> +
> >> +static bool has_cap_simd_regs(void)
> >> +{
> >> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >> +       static bool has_cap_simd_regs;
> >> +       static bool cached;
> >> +
> >> +       if (cached)
> >> +               return has_cap_simd_regs;
> >> +
> >> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >> +       cached = true;
> >> +
> >> +       return has_cap_simd_regs;
> >> +}
> >> +
> >> +bool arch_has_simd_regs(u64 mask)
> >> +{
> >> +       return has_cap_simd_regs() &&
> >> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >> +}
> >> +
> >> +static const struct sample_reg sample_simd_reg_masks[] = {
> >> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >> +       SMPL_REG_END
> >> +};
> >> +
> >> +static const struct sample_reg sample_pred_reg_masks[] = {
> >> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >> +       SMPL_REG_END
> >> +};
> >> +
> >> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >> +{
> >> +       return sample_simd_reg_masks;
> >> +}
> >> +
> >> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >> +{
> >> +       return sample_pred_reg_masks;
> >> +}
> >> +
> >> +static bool x86_intr_simd_updated;
> >> +static u64 x86_intr_simd_reg_mask;
> >> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> > Could we add some comments? I can kind of figure out the updated is a
> > check for lazy initialization and what masks are, qwords is an odd
> > one. The comment could also point out that SIMD doesn't mean the
> > machine supports SIMD, but SIMD registers are supported in perf
> > events.
>
> Sure.
>
>
> >
> >> +static bool x86_user_simd_updated;
> >> +static u64 x86_user_simd_reg_mask;
> >> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >> +
> >> +static bool x86_intr_pred_updated;
> >> +static u64 x86_intr_pred_reg_mask;
> >> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >> +static bool x86_user_pred_updated;
> >> +static u64 x86_user_pred_reg_mask;
> >> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >> +
> >> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool supported;
> >> +       u64 mask = 0;
> >> +       int reg;
> >> +
> >> +       if (!has_cap_simd_regs())
> >> +               return 0;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >> +               return x86_intr_simd_reg_mask;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >> +               return x86_user_simd_reg_mask;
> >> +
> >> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >> +               supported = false;
> >> +
> >> +               if (!r->mask)
> >> +                       continue;
> >> +               reg = fls64(r->mask) - 1;
> >> +
> >> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >> +                       break;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >> +                                                        &x86_intr_simd_mask[reg],
> >> +                                                        &x86_intr_simd_qwords[reg]);
> >> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >> +                                                        &x86_user_simd_mask[reg],
> >> +                                                        &x86_user_simd_qwords[reg]);
> >> +               if (supported)
> >> +                       mask |= BIT_ULL(reg);
> >> +       }
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >> +               x86_intr_simd_reg_mask = mask;
> >> +               x86_intr_simd_updated = true;
> >> +       } else {
> >> +               x86_user_simd_reg_mask = mask;
> >> +               x86_user_simd_updated = true;
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> >> +
> >> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool supported;
> >> +       u64 mask = 0;
> >> +       int reg;
> >> +
> >> +       if (!has_cap_simd_regs())
> >> +               return 0;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >> +               return x86_intr_pred_reg_mask;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >> +               return x86_user_pred_reg_mask;
> >> +
> >> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >> +               supported = false;
> >> +
> >> +               if (!r->mask)
> >> +                       continue;
> >> +               reg = fls64(r->mask) - 1;
> >> +
> >> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >> +                       break;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >> +                                                        &x86_intr_pred_mask[reg],
> >> +                                                        &x86_intr_pred_qwords[reg]);
> >> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >> +                                                        &x86_user_pred_mask[reg],
> >> +                                                        &x86_user_pred_qwords[reg]);
> >> +               if (supported)
> >> +                       mask |= BIT_ULL(reg);
> >> +       }
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >> +               x86_intr_pred_reg_mask = mask;
> >> +               x86_intr_pred_updated = true;
> >> +       } else {
> >> +               x86_user_pred_reg_mask = mask;
> >> +               x86_user_pred_updated = true;
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> > This feels repetitive with __arch__simd_reg_mask, could they be
> > refactored together?
>
> hmm, it looks we can extract the for loop as a common function. The other
> parts are hard to be generalized since they are manipulating different
> variables. If we want to generalize them, we have to introduce lots of "if
> ... else" branches and that would make code hard to be read.
>
>
> >
> >> +
> >> +uint64_t arch__intr_simd_reg_mask(void)
> >> +{
> >> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >> +}
> >> +
> >> +uint64_t arch__user_simd_reg_mask(void)
> >> +{
> >> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >> +}
> >> +
> >> +uint64_t arch__intr_pred_reg_mask(void)
> >> +{
> >> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >> +}
> >> +
> >> +uint64_t arch__user_pred_reg_mask(void)
> >> +{
> >> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >> +}
> >> +
> >> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >> +{
> >> +       uint64_t mask = 0;
> >> +
> >> +       *qwords = 0;
> >> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >> +               if (intr) {
> >> +                       *qwords = x86_intr_simd_qwords[reg];
> >> +                       mask = x86_intr_simd_mask[reg];
> >> +               } else {
> >> +                       *qwords = x86_user_simd_qwords[reg];
> >> +                       mask = x86_user_simd_mask[reg];
> >> +               }
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> >> +
> >> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >> +{
> >> +       uint64_t mask = 0;
> >> +
> >> +       *qwords = 0;
> >> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >> +               if (intr) {
> >> +                       *qwords = x86_intr_pred_qwords[reg];
> >> +                       mask = x86_intr_pred_mask[reg];
> >> +               } else {
> >> +                       *qwords = x86_user_pred_qwords[reg];
> >> +                       mask = x86_user_pred_mask[reg];
> >> +               }
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> >> +
> >> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_intr_simd_updated)
> >> +               arch__intr_simd_reg_mask();
> >> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >> +}
> >> +
> >> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_user_simd_updated)
> >> +               arch__user_simd_reg_mask();
> >> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >> +}
> >> +
> >> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_intr_pred_updated)
> >> +               arch__intr_pred_reg_mask();
> >> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >> +}
> >> +
> >> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_user_pred_updated)
> >> +               arch__user_pred_reg_mask();
> >> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >> +}
> >> +
> >>  const struct sample_reg *arch__sample_reg_masks(void)
> >>  {
> >> +       if (has_cap_simd_regs())
> >> +               return sample_reg_masks_ext;
> >>         return sample_reg_masks;
> >>  }
> >>
> >> -uint64_t arch__intr_reg_mask(void)
> >> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>  {
> >>         struct perf_event_attr attr = {
> >> -               .type                   = PERF_TYPE_HARDWARE,
> >> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >> -               .precise_ip             = 1,
> >> -               .disabled               = 1,
> >> -               .exclude_kernel         = 1,
> >> +               .type                           = PERF_TYPE_HARDWARE,
> >> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >> +               .sample_type                    = sample_type,
> >> +               .precise_ip                     = 1,
> >> +               .disabled                       = 1,
> >> +               .exclude_kernel                 = 1,
> >> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>         };
> >>         int fd;
> >>         /*
> >>          * In an unnamed union, init it here to build on older gcc versions
> >>          */
> >>         attr.sample_period = 1;
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +               attr.sample_regs_intr = mask;
> >> +       else
> >> +               attr.sample_regs_user = mask;
> >>
> >>         if (perf_pmus__num_core_pmus() > 1) {
> >>                 struct perf_pmu *pmu = NULL;
> >> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>         if (fd != -1) {
> >>                 close(fd);
> >> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >> +               return mask;
> >>         }
> >>
> >> -       return PERF_REGS_MASK;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t arch__intr_reg_mask(void)
> >> +{
> >> +       uint64_t mask = PERF_REGS_MASK;
> >> +
> >> +       if (has_cap_simd_regs()) {
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >> +                                        true);
> > It's nice to label constant arguments like this something like:
> > /*has_simd_regs=*/true);
> >
> > Tools like clang-tidy even try to enforce the argument names match the comments.
>
> Sure.
>
>
> >
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >> +                                        true);
> >> +       } else
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >> +
> >> +       return mask;
> >>  }
> >>
> >>  uint64_t arch__user_reg_mask(void)
> >>  {
> >> -       return PERF_REGS_MASK;
> >> +       uint64_t mask = PERF_REGS_MASK;
> >> +
> >> +       if (has_cap_simd_regs()) {
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >> +                                        true);
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >> +                                        true);
> >> +       }
> >> +
> >> +       return mask;
> > The code is repetitive here, could we refactor into a single function
> > passing in a user or instr value?
>
> Sure. Would extract the common part.
>
>
> >
> >>  }
> >> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >> index 56ebefd075f2..5d1d90cf9488 100644
> >> --- a/tools/perf/util/evsel.c
> >> +++ b/tools/perf/util/evsel.c
> >> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>             !evsel__is_dummy_event(evsel)) {
> >>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >> +       }
> >> +
> >> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >> +               if (opts->sample_pred_regs_qwords)
> >> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >> +               else
> >> +                       attr->sample_simd_pred_reg_qwords = 1;
> >> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>         }
> >>
> >>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>             !evsel__is_dummy_event(evsel)) {
> >>                 attr->sample_regs_user |= opts->sample_user_regs;
> >> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >> +               evsel__set_sample_bit(evsel, REGS_USER);
> >> +       }
> >> +
> >> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >> +               if (opts->sample_pred_regs_qwords)
> >> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >> +               else
> >> +                       attr->sample_simd_pred_reg_qwords = 1;
> >> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>         }
> >>
> >> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >> index cda1c620968e..0bd100392889 100644
> >> --- a/tools/perf/util/parse-regs-options.c
> >> +++ b/tools/perf/util/parse-regs-options.c
> >> @@ -4,19 +4,139 @@
> >>  #include <stdint.h>
> >>  #include <string.h>
> >>  #include <stdio.h>
> >> +#include <linux/bitops.h>
> >>  #include "util/debug.h"
> >>  #include <subcmd/parse-options.h>
> >>  #include "util/perf_regs.h"
> >>  #include "util/parse-regs-options.h"
> >> +#include "record.h"
> >> +
> >> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       if (!simd_mask)
> >> +               return;
> >> +
> >> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >> +               if (!(r->mask & simd_mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               if (bitmap)
> >> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >> +       }
> >> +}
> >> +
> >> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       if (!pred_mask)
> >> +               return;
> >> +
> >> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >> +               if (!(r->mask & pred_mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               if (bitmap)
> >> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >> +       }
> >> +}
> >> +
> >> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool matched = false;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >> +               if (strcasecmp(s, r->name))
> >> +                       continue;
> >> +               if (!fls64(r->mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               matched = true;
> >> +               break;
> >> +       }
> >> +
> >> +       /* Just need the highest qwords */
> > I'm not following here. Does the bitmap need to handle gaps?
>
> Currently no. In theory, the kernel supports user space only samples a
> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
> supports 16 XMM registers on XMM), but it's not supported to avoid
> introducing too much complexity in perf tools. Moreover, I don't think end
> users have such requirement. In most cases, users should only know which
> kinds of SIMD registers their programs use but usually doesn't know and
> care about which exact SIMD register is used.
>
>
> >
> >> +       if (qwords > opts->sample_vec_regs_qwords) {
> >> +               opts->sample_vec_regs_qwords = qwords;
> >> +               if (intr)
> >> +                       opts->sample_intr_vec_regs = bitmap;
> >> +               else
> >> +                       opts->sample_user_vec_regs = bitmap;
> >> +       }
> >> +
> >> +       return matched;
> >> +}
> >> +
> >> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool matched = false;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >> +               if (strcasecmp(s, r->name))
> >> +                       continue;
> >> +               if (!fls64(r->mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               matched = true;
> >> +               break;
> >> +       }
> >> +
> >> +       /* Just need the highest qwords */
> > Again repetitive, could we have a single function?
>
> Yes, I suppose the for loop at least can be extracted as a common function.
>
>
> >
> >> +       if (qwords > opts->sample_pred_regs_qwords) {
> >> +               opts->sample_pred_regs_qwords = qwords;
> >> +               if (intr)
> >> +                       opts->sample_intr_pred_regs = bitmap;
> >> +               else
> >> +                       opts->sample_user_pred_regs = bitmap;
> >> +       }
> >> +
> >> +       return matched;
> >> +}
> >>
> >>  static int
> >>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>  {
> >>         uint64_t *mode = (uint64_t *)opt->value;
> >>         const struct sample_reg *r = NULL;
> >> +       struct record_opts *opts;
> >>         char *s, *os = NULL, *p;
> >> -       int ret = -1;
> >> +       bool has_simd_regs = false;
> >>         uint64_t mask;
> >> +       uint64_t simd_mask;
> >> +       uint64_t pred_mask;
> >> +       int ret = -1;
> >>
> >>         if (unset)
> >>                 return 0;
> >> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>         if (*mode)
> >>                 return -1;
> >>
> >> -       if (intr)
> >> +       if (intr) {
> >> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>                 mask = arch__intr_reg_mask();
> >> -       else
> >> +               simd_mask = arch__intr_simd_reg_mask();
> >> +               pred_mask = arch__intr_pred_reg_mask();
> >> +       } else {
> >> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>                 mask = arch__user_reg_mask();
> >> +               simd_mask = arch__user_simd_reg_mask();
> >> +               pred_mask = arch__user_pred_reg_mask();
> >> +       }
> >>
> >>         /* str may be NULL in case no arg is passed to -I */
> >>         if (str) {
> >> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>                                         if (r->mask & mask)
> >>                                                 fprintf(stderr, "%s ", r->name);
> >>                                 }
> >> +                               __print_simd_regs(intr, simd_mask);
> >> +                               __print_pred_regs(intr, pred_mask);
> >>                                 fputc('\n', stderr);
> >>                                 /* just printing available regs */
> >>                                 goto error;
> >>                         }
> >> +
> >> +                       if (simd_mask) {
> >> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >> +                               if (has_simd_regs)
> >> +                                       goto next;
> >> +                       }
> >> +                       if (pred_mask) {
> >> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >> +                               if (has_simd_regs)
> >> +                                       goto next;
> >> +                       }
> >> +
> >>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>                                         break;
> >> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>                         }
> >>
> >>                         *mode |= r->mask;
> >> -
> >> +next:
> >>                         if (!p)
> >>                                 break;
> >>
> >> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>         ret = 0;
> >>
> >>         /* default to all possible regs */
> >> -       if (*mode == 0)
> >> +       if (*mode == 0 && !has_simd_regs)
> >>                 *mode = mask;
> >>  error:
> >>         free(os);
> >> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >> index 66b666d9ce64..fb0366d050cf 100644
> >> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>         PRINT_ATTRf(aux_resume, p_unsigned);
> >> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>
> >>         return ret;
> >>  }
> >> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >> index 44b90bbf2d07..e8a9fabc92e6 100644
> >> --- a/tools/perf/util/perf_regs.c
> >> +++ b/tools/perf/util/perf_regs.c
> >> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>         return SDT_ARG_SKIP;
> >>  }
> >>
> >> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >> +{
> >> +       return false;
> >> +}
> >> +
> >>  uint64_t __weak arch__intr_reg_mask(void)
> >>  {
> >>         return 0;
> >> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>         return 0;
> >>  }
> >>
> >> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_simd_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_pred_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >>  static const struct sample_reg sample_reg_masks[] = {
> >>         SMPL_REG_END
> >>  };
> >> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>         return sample_reg_masks;
> >>  }
> >>
> >> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >> +{
> >> +       return sample_reg_masks;
> >> +}
> >> +
> >> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >> +{
> >> +       return sample_reg_masks;
> >> +}
> > Thinking out loud. I wonder if there is a way to hide the weak
> > functions. It seems the support is tied to PMUs, particularly core
> > PMUs, perhaps we can push things into pmu and arch pmu code. Then we
> > ask the PMU to parse the register strings, set up the perf_event_attr,
> > etc. I'm somewhat scared these functions will be used on the report
> > rather than record side of things, thereby breaking perf.data support
> > when the host kernel does or doesn't have the SIMD support.
>
> Ian, I don't quite follow your words.
>
> I don't quite understand how should we do for "push things into pmu and
> arch pmu code". Current SIMD registers support follows the same way of the
> general registers support. If we intend to change the way entirely, we'd
> better have an independent patch-set.
>
> why these functions would break the perf.data repport? perf-report would
> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
> the flags is set (indicates there are SIMD registers data appended in the
> record), perf-report would try to parse the SIMD registers data.

Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
remove weak symbols like:
https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t

For these patches what I'm imagining is that there is a Nova Lake
generated perf.data file. Using perf report, script, etc. on the Nova
Lake should expose all of the same mask, qword, etc. values as when
the perf.data was generated and so things will work. If the perf.data
file was taken to say my Alderlake then what will happen? Generally
using the arch directory and weak symbols is a code smell that cross
platform things are going to break - there should be sufficient data
in the event and the perf_event_attr to fully decode what's going on.
Sometimes tying things to a PMU name can avoid the use of the arch
directory. We were able to avoid the arch directory to a good extent
for the TPEBS code, even though it is a very modern Intel feature.

Thanks,
Ian



> >
> > Thanks,
> > Ian
> >
> >> +
> >>  const char *perf_reg_name(int id, const char *arch)
> >>  {
> >>         const char *reg_name = NULL;
> >> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >> index f2d0736d65cc..bce9c4cfd1bf 100644
> >> --- a/tools/perf/util/perf_regs.h
> >> +++ b/tools/perf/util/perf_regs.h
> >> @@ -24,9 +24,20 @@ enum {
> >>  };
> >>
> >>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >> +bool arch_has_simd_regs(u64 mask);
> >>  uint64_t arch__intr_reg_mask(void);
> >>  uint64_t arch__user_reg_mask(void);
> >>  const struct sample_reg *arch__sample_reg_masks(void);
> >> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> >> +uint64_t arch__intr_simd_reg_mask(void);
> >> +uint64_t arch__user_simd_reg_mask(void);
> >> +uint64_t arch__intr_pred_reg_mask(void);
> >> +uint64_t arch__user_pred_reg_mask(void);
> >> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>
> >>  const char *perf_reg_name(int id, const char *arch);
> >>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >> index ea3a6c4657ee..825ffb4cc53f 100644
> >> --- a/tools/perf/util/record.h
> >> +++ b/tools/perf/util/record.h
> >> @@ -59,7 +59,13 @@ struct record_opts {
> >>         unsigned int  user_freq;
> >>         u64           branch_stack;
> >>         u64           sample_intr_regs;
> >> +       u64           sample_intr_vec_regs;
> >>         u64           sample_user_regs;
> >> +       u64           sample_user_vec_regs;
> >> +       u16           sample_pred_regs_qwords;
> >> +       u16           sample_vec_regs_qwords;
> >> +       u16           sample_intr_pred_regs;
> >> +       u16           sample_user_pred_regs;
> >>         u64           default_interval;
> >>         u64           user_interval;
> >>         size_t        auxtrace_snapshot_size;
> >> --
> >> 2.34.1
> >>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-04  7:49       ` Ian Rogers
@ 2025-12-04  9:20         ` Mi, Dapeng
  2025-12-04 16:16           ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-04  9:20 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/4/2025 3:49 PM, Ian Rogers wrote:
> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> This patch adds support for the newly introduced SIMD register sampling
>>>> format by adding the following functions:
>>>>
>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>> uint64_t arch__user_simd_reg_mask(void);
>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>> uint64_t arch__user_pred_reg_mask(void);
>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>
>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>
>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>
>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>
>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>> OPMASK).
>>>>
>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>> these newly introduced SIMD registers. Currently, each type of register
>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>> not supported. For example, all XMM registers are sampled together rather
>>>> than sampling only XMM0.
>>>>
>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>
>>>> With this patch, all supported sampling registers on x86 platforms are
>>>> displayed as follows.
>>>>
>>>>  $perf record -I?
>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>
>>>>  $perf record --user-regs=?
>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>
>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>> ---
>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>  tools/perf/util/record.h                  |   6 +
>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>> index 12fd93f04802..db41430f3b07 100644
>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>> @@ -13,6 +13,49 @@
>>>>  #include "../../../util/pmu.h"
>>>>  #include "../../../util/pmus.h"
>>>>
>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>> +#endif
>>>> +       SMPL_REG_END
>>>> +};
>>>> +
>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>         return SDT_ARG_VALID;
>>>>  }
>>>>
>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>> To make the code easier to read, it'd be nice to document sample_type,
>>> qwords and mask here.
>> Sure.
>>
>>
>>>> +{
>>>> +       struct perf_event_attr attr = {
>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>> +               .sample_type                    = sample_type,
>>>> +               .disabled                       = 1,
>>>> +               .exclude_kernel                 = 1,
>>>> +               .sample_simd_regs_enabled       = 1,
>>>> +       };
>>>> +       int fd;
>>>> +
>>>> +       attr.sample_period = 1;
>>>> +
>>>> +       if (!pred) {
>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>> +               else
>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>> +       } else {
>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>> +               else
>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>> +       }
>>>> +
>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>> +               struct perf_pmu *pmu = NULL;
>>>> +               __u64 type = PERF_TYPE_RAW;
>>> It should be okay to do:
>>> __u64 type = perf_pmus__find_core_pmu()->type
>>> rather than have the whole loop below.
>> Sure. Thanks.
>>
>>
>>>> +
>>>> +               /*
>>>> +                * The same register set is supported among different hybrid PMUs.
>>>> +                * Only check the first available one.
>>>> +                */
>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>> +                       type = pmu->type;
>>>> +                       break;
>>>> +               }
>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>> +       }
>>>> +
>>>> +       event_attr_init(&attr);
>>>> +
>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>> +       if (fd != -1) {
>>>> +               close(fd);
>>>> +               return true;
>>>> +       }
>>>> +
>>>> +       return false;
>>>> +}
>>>> +
>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>> +{
>>>> +       bool supported = false;
>>>> +       u64 bits;
>>>> +
>>>> +       *mask = 0;
>>>> +       *qwords = 0;
>>>> +
>>>> +       switch (reg) {
>>>> +       case PERF_REG_X86_XMM:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       case PERF_REG_X86_YMM:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       case PERF_REG_X86_ZMM:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>> +                       break;
>>>> +               }
>>>> +
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       default:
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       return supported;
>>>> +}
>>>> +
>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>> +{
>>>> +       bool supported = false;
>>>> +       u64 bits;
>>>> +
>>>> +       *mask = 0;
>>>> +       *qwords = 0;
>>>> +
>>>> +       switch (reg) {
>>>> +       case PERF_REG_X86_OPMASK:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       default:
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       return supported;
>>>> +}
>>>> +
>>>> +static bool has_cap_simd_regs(void)
>>>> +{
>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>> +       static bool has_cap_simd_regs;
>>>> +       static bool cached;
>>>> +
>>>> +       if (cached)
>>>> +               return has_cap_simd_regs;
>>>> +
>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>> +       cached = true;
>>>> +
>>>> +       return has_cap_simd_regs;
>>>> +}
>>>> +
>>>> +bool arch_has_simd_regs(u64 mask)
>>>> +{
>>>> +       return has_cap_simd_regs() &&
>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>> +}
>>>> +
>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>> +       SMPL_REG_END
>>>> +};
>>>> +
>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>> +       SMPL_REG_END
>>>> +};
>>>> +
>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>> +{
>>>> +       return sample_simd_reg_masks;
>>>> +}
>>>> +
>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>> +{
>>>> +       return sample_pred_reg_masks;
>>>> +}
>>>> +
>>>> +static bool x86_intr_simd_updated;
>>>> +static u64 x86_intr_simd_reg_mask;
>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>> Could we add some comments? I can kind of figure out the updated is a
>>> check for lazy initialization and what masks are, qwords is an odd
>>> one. The comment could also point out that SIMD doesn't mean the
>>> machine supports SIMD, but SIMD registers are supported in perf
>>> events.
>> Sure.
>>
>>
>>>> +static bool x86_user_simd_updated;
>>>> +static u64 x86_user_simd_reg_mask;
>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>> +
>>>> +static bool x86_intr_pred_updated;
>>>> +static u64 x86_intr_pred_reg_mask;
>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>> +static bool x86_user_pred_updated;
>>>> +static u64 x86_user_pred_reg_mask;
>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>> +
>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool supported;
>>>> +       u64 mask = 0;
>>>> +       int reg;
>>>> +
>>>> +       if (!has_cap_simd_regs())
>>>> +               return 0;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>> +               return x86_intr_simd_reg_mask;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>> +               return x86_user_simd_reg_mask;
>>>> +
>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>> +               supported = false;
>>>> +
>>>> +               if (!r->mask)
>>>> +                       continue;
>>>> +               reg = fls64(r->mask) - 1;
>>>> +
>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>> +                       break;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>> +                                                        &x86_intr_simd_mask[reg],
>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>> +                                                        &x86_user_simd_mask[reg],
>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>> +               if (supported)
>>>> +                       mask |= BIT_ULL(reg);
>>>> +       }
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>> +               x86_intr_simd_reg_mask = mask;
>>>> +               x86_intr_simd_updated = true;
>>>> +       } else {
>>>> +               x86_user_simd_reg_mask = mask;
>>>> +               x86_user_simd_updated = true;
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>>> +
>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool supported;
>>>> +       u64 mask = 0;
>>>> +       int reg;
>>>> +
>>>> +       if (!has_cap_simd_regs())
>>>> +               return 0;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>> +               return x86_intr_pred_reg_mask;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>> +               return x86_user_pred_reg_mask;
>>>> +
>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>> +               supported = false;
>>>> +
>>>> +               if (!r->mask)
>>>> +                       continue;
>>>> +               reg = fls64(r->mask) - 1;
>>>> +
>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>> +                       break;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>> +                                                        &x86_intr_pred_mask[reg],
>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>> +                                                        &x86_user_pred_mask[reg],
>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>> +               if (supported)
>>>> +                       mask |= BIT_ULL(reg);
>>>> +       }
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>> +               x86_intr_pred_reg_mask = mask;
>>>> +               x86_intr_pred_updated = true;
>>>> +       } else {
>>>> +               x86_user_pred_reg_mask = mask;
>>>> +               x86_user_pred_updated = true;
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>> This feels repetitive with __arch__simd_reg_mask, could they be
>>> refactored together?
>> hmm, it looks we can extract the for loop as a common function. The other
>> parts are hard to be generalized since they are manipulating different
>> variables. If we want to generalize them, we have to introduce lots of "if
>> ... else" branches and that would make code hard to be read.
>>
>>
>>>> +
>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>> +{
>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>> +{
>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>> +{
>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>> +{
>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>> +}
>>>> +
>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>> +{
>>>> +       uint64_t mask = 0;
>>>> +
>>>> +       *qwords = 0;
>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>> +               if (intr) {
>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>> +                       mask = x86_intr_simd_mask[reg];
>>>> +               } else {
>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>> +                       mask = x86_user_simd_mask[reg];
>>>> +               }
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>>> +
>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>> +{
>>>> +       uint64_t mask = 0;
>>>> +
>>>> +       *qwords = 0;
>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>> +               if (intr) {
>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>> +                       mask = x86_intr_pred_mask[reg];
>>>> +               } else {
>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>> +                       mask = x86_user_pred_mask[reg];
>>>> +               }
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_intr_simd_updated)
>>>> +               arch__intr_simd_reg_mask();
>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_user_simd_updated)
>>>> +               arch__user_simd_reg_mask();
>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_intr_pred_updated)
>>>> +               arch__intr_pred_reg_mask();
>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_user_pred_updated)
>>>> +               arch__user_pred_reg_mask();
>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>> +}
>>>> +
>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>  {
>>>> +       if (has_cap_simd_regs())
>>>> +               return sample_reg_masks_ext;
>>>>         return sample_reg_masks;
>>>>  }
>>>>
>>>> -uint64_t arch__intr_reg_mask(void)
>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>  {
>>>>         struct perf_event_attr attr = {
>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>> -               .precise_ip             = 1,
>>>> -               .disabled               = 1,
>>>> -               .exclude_kernel         = 1,
>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>> +               .sample_type                    = sample_type,
>>>> +               .precise_ip                     = 1,
>>>> +               .disabled                       = 1,
>>>> +               .exclude_kernel                 = 1,
>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>         };
>>>>         int fd;
>>>>         /*
>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>          */
>>>>         attr.sample_period = 1;
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +               attr.sample_regs_intr = mask;
>>>> +       else
>>>> +               attr.sample_regs_user = mask;
>>>>
>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>                 struct perf_pmu *pmu = NULL;
>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>         if (fd != -1) {
>>>>                 close(fd);
>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>> +               return mask;
>>>>         }
>>>>
>>>> -       return PERF_REGS_MASK;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_reg_mask(void)
>>>> +{
>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>> +
>>>> +       if (has_cap_simd_regs()) {
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>> +                                        true);
>>> It's nice to label constant arguments like this something like:
>>> /*has_simd_regs=*/true);
>>>
>>> Tools like clang-tidy even try to enforce the argument names match the comments.
>> Sure.
>>
>>
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>> +                                        true);
>>>> +       } else
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>> +
>>>> +       return mask;
>>>>  }
>>>>
>>>>  uint64_t arch__user_reg_mask(void)
>>>>  {
>>>> -       return PERF_REGS_MASK;
>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>> +
>>>> +       if (has_cap_simd_regs()) {
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>> +                                        true);
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>> +                                        true);
>>>> +       }
>>>> +
>>>> +       return mask;
>>> The code is repetitive here, could we refactor into a single function
>>> passing in a user or instr value?
>> Sure. Would extract the common part.
>>
>>
>>>>  }
>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>> --- a/tools/perf/util/evsel.c
>>>> +++ b/tools/perf/util/evsel.c
>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>             !evsel__is_dummy_event(evsel)) {
>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>> +       }
>>>> +
>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>> +               if (opts->sample_pred_regs_qwords)
>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>> +               else
>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>         }
>>>>
>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>             !evsel__is_dummy_event(evsel)) {
>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>> +       }
>>>> +
>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>> +               if (opts->sample_pred_regs_qwords)
>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>> +               else
>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>         }
>>>>
>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>> index cda1c620968e..0bd100392889 100644
>>>> --- a/tools/perf/util/parse-regs-options.c
>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>> @@ -4,19 +4,139 @@
>>>>  #include <stdint.h>
>>>>  #include <string.h>
>>>>  #include <stdio.h>
>>>> +#include <linux/bitops.h>
>>>>  #include "util/debug.h"
>>>>  #include <subcmd/parse-options.h>
>>>>  #include "util/perf_regs.h"
>>>>  #include "util/parse-regs-options.h"
>>>> +#include "record.h"
>>>> +
>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       if (!simd_mask)
>>>> +               return;
>>>> +
>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>> +               if (!(r->mask & simd_mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               if (bitmap)
>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>> +       }
>>>> +}
>>>> +
>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       if (!pred_mask)
>>>> +               return;
>>>> +
>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>> +               if (!(r->mask & pred_mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               if (bitmap)
>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>> +       }
>>>> +}
>>>> +
>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool matched = false;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>> +               if (strcasecmp(s, r->name))
>>>> +                       continue;
>>>> +               if (!fls64(r->mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               matched = true;
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       /* Just need the highest qwords */
>>> I'm not following here. Does the bitmap need to handle gaps?
>> Currently no. In theory, the kernel supports user space only samples a
>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
>> supports 16 XMM registers on XMM), but it's not supported to avoid
>> introducing too much complexity in perf tools. Moreover, I don't think end
>> users have such requirement. In most cases, users should only know which
>> kinds of SIMD registers their programs use but usually doesn't know and
>> care about which exact SIMD register is used.
>>
>>
>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>> +               if (intr)
>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>> +               else
>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>> +       }
>>>> +
>>>> +       return matched;
>>>> +}
>>>> +
>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool matched = false;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>> +               if (strcasecmp(s, r->name))
>>>> +                       continue;
>>>> +               if (!fls64(r->mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               matched = true;
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       /* Just need the highest qwords */
>>> Again repetitive, could we have a single function?
>> Yes, I suppose the for loop at least can be extracted as a common function.
>>
>>
>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>> +               if (intr)
>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>> +               else
>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>> +       }
>>>> +
>>>> +       return matched;
>>>> +}
>>>>
>>>>  static int
>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>  {
>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>         const struct sample_reg *r = NULL;
>>>> +       struct record_opts *opts;
>>>>         char *s, *os = NULL, *p;
>>>> -       int ret = -1;
>>>> +       bool has_simd_regs = false;
>>>>         uint64_t mask;
>>>> +       uint64_t simd_mask;
>>>> +       uint64_t pred_mask;
>>>> +       int ret = -1;
>>>>
>>>>         if (unset)
>>>>                 return 0;
>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>         if (*mode)
>>>>                 return -1;
>>>>
>>>> -       if (intr)
>>>> +       if (intr) {
>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>                 mask = arch__intr_reg_mask();
>>>> -       else
>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>> +       } else {
>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>                 mask = arch__user_reg_mask();
>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>> +       }
>>>>
>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>         if (str) {
>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>                                         if (r->mask & mask)
>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>                                 }
>>>> +                               __print_simd_regs(intr, simd_mask);
>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>                                 fputc('\n', stderr);
>>>>                                 /* just printing available regs */
>>>>                                 goto error;
>>>>                         }
>>>> +
>>>> +                       if (simd_mask) {
>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>> +                               if (has_simd_regs)
>>>> +                                       goto next;
>>>> +                       }
>>>> +                       if (pred_mask) {
>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>> +                               if (has_simd_regs)
>>>> +                                       goto next;
>>>> +                       }
>>>> +
>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>                                         break;
>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>                         }
>>>>
>>>>                         *mode |= r->mask;
>>>> -
>>>> +next:
>>>>                         if (!p)
>>>>                                 break;
>>>>
>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>         ret = 0;
>>>>
>>>>         /* default to all possible regs */
>>>> -       if (*mode == 0)
>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>                 *mode = mask;
>>>>  error:
>>>>         free(os);
>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>
>>>>         return ret;
>>>>  }
>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>> --- a/tools/perf/util/perf_regs.c
>>>> +++ b/tools/perf/util/perf_regs.c
>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>         return SDT_ARG_SKIP;
>>>>  }
>>>>
>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>> +{
>>>> +       return false;
>>>> +}
>>>> +
>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>  {
>>>>         return 0;
>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>         return 0;
>>>>  }
>>>>
>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>         SMPL_REG_END
>>>>  };
>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>         return sample_reg_masks;
>>>>  }
>>>>
>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>> +{
>>>> +       return sample_reg_masks;
>>>> +}
>>>> +
>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>> +{
>>>> +       return sample_reg_masks;
>>>> +}
>>> Thinking out loud. I wonder if there is a way to hide the weak
>>> functions. It seems the support is tied to PMUs, particularly core
>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
>>> ask the PMU to parse the register strings, set up the perf_event_attr,
>>> etc. I'm somewhat scared these functions will be used on the report
>>> rather than record side of things, thereby breaking perf.data support
>>> when the host kernel does or doesn't have the SIMD support.
>> Ian, I don't quite follow your words.
>>
>> I don't quite understand how should we do for "push things into pmu and
>> arch pmu code". Current SIMD registers support follows the same way of the
>> general registers support. If we intend to change the way entirely, we'd
>> better have an independent patch-set.
>>
>> why these functions would break the perf.data repport? perf-report would
>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
>> the flags is set (indicates there are SIMD registers data appended in the
>> record), perf-report would try to parse the SIMD registers data.
> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
> remove weak symbols like:
> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
>
> For these patches what I'm imagining is that there is a Nova Lake
> generated perf.data file. Using perf report, script, etc. on the Nova
> Lake should expose all of the same mask, qword, etc. values as when
> the perf.data was generated and so things will work. If the perf.data
> file was taken to say my Alderlake then what will happen? Generally
> using the arch directory and weak symbols is a code smell that cross
> platform things are going to break - there should be sufficient data
> in the event and the perf_event_attr to fully decode what's going on.
> Sometimes tying things to a PMU name can avoid the use of the arch
> directory. We were able to avoid the arch directory to a good extent
> for the TPEBS code, even though it is a very modern Intel feature.

I see. 

But the sampling support for SIMD registers is different with the sample
weight processing in the patch
https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
Each arch may support different kinds of SIMD registers and furthermore
each kind of SIMD register may have different register number and register
width. It's quite hard to figure out some common functions or fields to
represent the name and attributes of these arch-specific SIMD registers.
These arch-specific information can only be told by the arch-specific code.
So it looks the __weak functions are still the easiest way to implement this.

I don't think the perf.data parsing would be broken from a platform to
another different platform (same arch), e.g., from Nova Lake to Alder Lake.
To indicates the presence of SIMD registers in record data, a new ABI flag
"PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
platform is new enough and can recognize this new flag, then the SIMD
registers data would be parsed correctly. Even though the perf tool is old
and have no support of SIMD register, the data of SIMD registers would just
be silently ignored and should not break the parsing.


>
> Thanks,
> Ian
>
>
>
>>> Thanks,
>>> Ian
>>>
>>>> +
>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>  {
>>>>         const char *reg_name = NULL;
>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>> --- a/tools/perf/util/perf_regs.h
>>>> +++ b/tools/perf/util/perf_regs.h
>>>> @@ -24,9 +24,20 @@ enum {
>>>>  };
>>>>
>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>> +bool arch_has_simd_regs(u64 mask);
>>>>  uint64_t arch__intr_reg_mask(void);
>>>>  uint64_t arch__user_reg_mask(void);
>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>
>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>> --- a/tools/perf/util/record.h
>>>> +++ b/tools/perf/util/record.h
>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>         unsigned int  user_freq;
>>>>         u64           branch_stack;
>>>>         u64           sample_intr_regs;
>>>> +       u64           sample_intr_vec_regs;
>>>>         u64           sample_user_regs;
>>>> +       u64           sample_user_vec_regs;
>>>> +       u16           sample_pred_regs_qwords;
>>>> +       u16           sample_vec_regs_qwords;
>>>> +       u16           sample_intr_pred_regs;
>>>> +       u16           sample_user_pred_regs;
>>>>         u64           default_interval;
>>>>         u64           user_interval;
>>>>         size_t        auxtrace_snapshot_size;
>>>> --
>>>> 2.34.1
>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-03  6:54 ` [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER Dapeng Mi
@ 2025-12-04 15:17   ` Peter Zijlstra
  2025-12-04 15:47     ` Peter Zijlstra
  2025-12-04 18:59     ` Dave Hansen
  0 siblings, 2 replies; 86+ messages in thread
From: Peter Zijlstra @ 2025-12-04 15:17 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:47PM +0800, Dapeng Mi wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> While collecting XMM registers in a PEBS record has been supported since
> Icelake, non-PEBS events have lacked this feature. By leveraging the
> xsaves instruction, it is now possible to snapshot XMM registers for
> non-PEBS events, completing the feature set.
> 
> To utilize the xsaves instruction, a 64-byte aligned buffer is required.
> A per-CPU ext_regs_buf is added to store SIMD and other registers, with
> the buffer size being approximately 2K. The buffer is allocated using
> kzalloc_node(), ensuring natural alignment and 64-byte alignment for all
> kmalloc() allocations with powers of 2.
> 
> The XMM sampling support is extended for both REGS_USER and REGS_INTR.
> For REGS_USER, perf_get_regs_user() returns the registers from
> task_pt_regs(current), which is a pt_regs structure. It needs to be
> copied to user space secific x86_user_regs structure since kernel may
> modify pt_regs structure later.
> 
> For PEBS, XMM registers are retrieved from PEBS records.
> 
> In cases where userspace tasks are trapped within kernel mode (e.g.,
> during a syscall) when an NMI arrives, pt_regs information can still be
> retrieved from task_pt_regs(). However, capturing SIMD and other
> xsave-based registers in this scenario is challenging. Therefore,
> snapshots for these registers are omitted in such cases.
> 
> The reasons are:
> - Profiling a userspace task that requires SIMD/eGPR registers typically
>   involves NMIs hitting userspace, not kernel mode.
> - Although it is possible to retrieve values when the TIF_NEED_FPU_LOAD
>   flag is set, the complexity introduced to handle this uncommon case in
>   the critical path is not justified.
> - Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
>   Some corner cases, such as an NMI occurring just after the flag switches
>   but still in kernel mode, cannot be handled.

Urgh.. Dave, Thomas, is there any reason we could not set
TIF_NEED_FPU_LOAD *after* doing the XSAVE (clearing is already done
after restore).

That way, when an NMI sees TIF_NEED_FPU_LOAD it knows the task copy is
consistent.

I'm not at all sure this is complex, it just needs a little care.

And then there is the deferred thing, just like unwind, we can defer
REGS_USER/STACK_USER much the same, except someone went and built all
that deferred stuff with unwind all tangled into it :/

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-04 15:17   ` Peter Zijlstra
@ 2025-12-04 15:47     ` Peter Zijlstra
  2025-12-05  6:37       ` Mi, Dapeng
  2025-12-04 18:59     ` Dave Hansen
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2025-12-04 15:47 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Thu, Dec 04, 2025 at 04:17:35PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:47PM +0800, Dapeng Mi wrote:
> > From: Kan Liang <kan.liang@linux.intel.com>
> > 
> > While collecting XMM registers in a PEBS record has been supported since
> > Icelake, non-PEBS events have lacked this feature. By leveraging the
> > xsaves instruction, it is now possible to snapshot XMM registers for
> > non-PEBS events, completing the feature set.
> > 
> > To utilize the xsaves instruction, a 64-byte aligned buffer is required.
> > A per-CPU ext_regs_buf is added to store SIMD and other registers, with
> > the buffer size being approximately 2K. The buffer is allocated using
> > kzalloc_node(), ensuring natural alignment and 64-byte alignment for all
> > kmalloc() allocations with powers of 2.
> > 
> > The XMM sampling support is extended for both REGS_USER and REGS_INTR.
> > For REGS_USER, perf_get_regs_user() returns the registers from
> > task_pt_regs(current), which is a pt_regs structure. It needs to be
> > copied to user space secific x86_user_regs structure since kernel may
> > modify pt_regs structure later.
> > 
> > For PEBS, XMM registers are retrieved from PEBS records.
> > 
> > In cases where userspace tasks are trapped within kernel mode (e.g.,
> > during a syscall) when an NMI arrives, pt_regs information can still be
> > retrieved from task_pt_regs(). However, capturing SIMD and other
> > xsave-based registers in this scenario is challenging. Therefore,
> > snapshots for these registers are omitted in such cases.
> > 
> > The reasons are:
> > - Profiling a userspace task that requires SIMD/eGPR registers typically
> >   involves NMIs hitting userspace, not kernel mode.
> > - Although it is possible to retrieve values when the TIF_NEED_FPU_LOAD
> >   flag is set, the complexity introduced to handle this uncommon case in
> >   the critical path is not justified.
> > - Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
> >   Some corner cases, such as an NMI occurring just after the flag switches
> >   but still in kernel mode, cannot be handled.
> 
> Urgh.. Dave, Thomas, is there any reason we could not set
> TIF_NEED_FPU_LOAD *after* doing the XSAVE (clearing is already done
> after restore).
> 
> That way, when an NMI sees TIF_NEED_FPU_LOAD it knows the task copy is
> consistent.
> 
> I'm not at all sure this is complex, it just needs a little care.
> 
> And then there is the deferred thing, just like unwind, we can defer
> REGS_USER/STACK_USER much the same, except someone went and built all
> that deferred stuff with unwind all tangled into it :/

With something like the below, the NMI could do something like:

	struct xregs_state *xr = NULL;

	/*
	 * fpu code does:
	 *  XSAVE
	 *  set_thread_flag(TIF_NEED_FPU_LOAD)
	 *  ...
	 *  XRSTOR
	 *  clear_thread_flag(TIF_NEED_FPU_LOAD)
	 * therefore, when TIF_NEED_FPU_LOAD, the task fpu state holds a
	 * whole copy.
	 */
	if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
		struct fpu *fpu = x86_task_fpu(current);
		/*
		 * If __task_fpstate is set, it holds the right pointer,
		 * otherwise fpstate will.
		 */
		struct fpstate *fps = READ_ONCE(fpu->__task_fpstate);
		if (!fps)
			fps = fpu->fpstate;
		xr = &fps->regs.xregs_state;
	} else {
		/* like fpu_sync_fpstate(), except NMI local */
		xsave_nmi(xr, mask);
	}

	// frob xr into perf data

Or did I miss something? I've not looked at this very long and the above
was very vague on the actual issues.


diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index da233f20ae6f..0f91a0d7e799 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -359,18 +359,22 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 	struct fpstate *cur_fps = fpu->fpstate;
 
 	fpregs_lock();
-	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD))
+	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD)) {
 		save_fpregs_to_fpstate(fpu);
+		set_thread_flag(TIF_NEED_FPU_LOAD);
+	}
 
 	/* Swap fpstate */
 	if (enter_guest) {
-		fpu->__task_fpstate = cur_fps;
+		WRITE_ONCE(fpu->__task_fpstate, cur_fps);
+		barrier();
 		fpu->fpstate = guest_fps;
 		guest_fps->in_use = true;
 	} else {
 		guest_fps->in_use = false;
 		fpu->fpstate = fpu->__task_fpstate;
-		fpu->__task_fpstate = NULL;
+		barrier();
+		WRITE_ONCE(fpu->__task_fpstate, NULL);
 	}
 
 	cur_fps = fpu->fpstate;
@@ -456,8 +460,8 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 
 	if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
 	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
-		set_thread_flag(TIF_NEED_FPU_LOAD);
 		save_fpregs_to_fpstate(x86_task_fpu(current));
+		set_thread_flag(TIF_NEED_FPU_LOAD);
 	}
 	__cpu_invalidate_fpregs_state();
 

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-04  9:20         ` Mi, Dapeng
@ 2025-12-04 16:16           ` Ian Rogers
  2025-12-05  4:00             ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2025-12-04 16:16 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 12/4/2025 3:49 PM, Ian Rogers wrote:
> > On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 12/4/2025 8:17 AM, Ian Rogers wrote:
> >>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>
> >>>> This patch adds support for the newly introduced SIMD register sampling
> >>>> format by adding the following functions:
> >>>>
> >>>> uint64_t arch__intr_simd_reg_mask(void);
> >>>> uint64_t arch__user_simd_reg_mask(void);
> >>>> uint64_t arch__intr_pred_reg_mask(void);
> >>>> uint64_t arch__user_pred_reg_mask(void);
> >>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>
> >>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>>>
> >>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >>>> supported PRED registers, such as OPMASK on x86 platforms.
> >>>>
> >>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >>>> exact bitmap and number of qwords for a specific type of SIMD register.
> >>>> For example, for XMM registers on x86 platforms, the returned bitmap is
> >>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>>>
> >>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >>>> exact bitmap and number of qwords for a specific type of PRED register.
> >>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >>>> OPMASK).
> >>>>
> >>>> Additionally, the function __parse_regs() is enhanced to support parsing
> >>>> these newly introduced SIMD registers. Currently, each type of register
> >>>> can only be sampled collectively; sampling a specific SIMD register is
> >>>> not supported. For example, all XMM registers are sampled together rather
> >>>> than sampling only XMM0.
> >>>>
> >>>> When multiple overlapping register types, such as XMM and YMM, are
> >>>> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>>>
> >>>> With this patch, all supported sampling registers on x86 platforms are
> >>>> displayed as follows.
> >>>>
> >>>>  $perf record -I?
> >>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>
> >>>>  $perf record --user-regs=?
> >>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>
> >>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>> ---
> >>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>>>  tools/perf/util/evsel.c                   |  27 ++
> >>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>>>  tools/perf/util/perf_regs.c               |  59 +++
> >>>>  tools/perf/util/perf_regs.h               |  11 +
> >>>>  tools/perf/util/record.h                  |   6 +
> >>>>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>>>
> >>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >>>> index 12fd93f04802..db41430f3b07 100644
> >>>> --- a/tools/perf/arch/x86/util/perf_regs.c
> >>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >>>> @@ -13,6 +13,49 @@
> >>>>  #include "../../../util/pmu.h"
> >>>>  #include "../../../util/pmus.h"
> >>>>
> >>>> +static const struct sample_reg sample_reg_masks_ext[] = {
> >>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >>>> +#endif
> >>>> +       SMPL_REG_END
> >>>> +};
> >>>> +
> >>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>         SMPL_REG(BX, PERF_REG_X86_BX),
> >>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>>>         return SDT_ARG_VALID;
> >>>>  }
> >>>>
> >>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> >>> To make the code easier to read, it'd be nice to document sample_type,
> >>> qwords and mask here.
> >> Sure.
> >>
> >>
> >>>> +{
> >>>> +       struct perf_event_attr attr = {
> >>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>> +               .sample_type                    = sample_type,
> >>>> +               .disabled                       = 1,
> >>>> +               .exclude_kernel                 = 1,
> >>>> +               .sample_simd_regs_enabled       = 1,
> >>>> +       };
> >>>> +       int fd;
> >>>> +
> >>>> +       attr.sample_period = 1;
> >>>> +
> >>>> +       if (!pred) {
> >>>> +               attr.sample_simd_vec_reg_qwords = qwords;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       attr.sample_simd_vec_reg_intr = mask;
> >>>> +               else
> >>>> +                       attr.sample_simd_vec_reg_user = mask;
> >>>> +       } else {
> >>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >>>> +               else
> >>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >>>> +       }
> >>>> +
> >>>> +       if (perf_pmus__num_core_pmus() > 1) {
> >>>> +               struct perf_pmu *pmu = NULL;
> >>>> +               __u64 type = PERF_TYPE_RAW;
> >>> It should be okay to do:
> >>> __u64 type = perf_pmus__find_core_pmu()->type
> >>> rather than have the whole loop below.
> >> Sure. Thanks.
> >>
> >>
> >>>> +
> >>>> +               /*
> >>>> +                * The same register set is supported among different hybrid PMUs.
> >>>> +                * Only check the first available one.
> >>>> +                */
> >>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >>>> +                       type = pmu->type;
> >>>> +                       break;
> >>>> +               }
> >>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >>>> +       }
> >>>> +
> >>>> +       event_attr_init(&attr);
> >>>> +
> >>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>> +       if (fd != -1) {
> >>>> +               close(fd);
> >>>> +               return true;
> >>>> +       }
> >>>> +
> >>>> +       return false;
> >>>> +}
> >>>> +
> >>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>> +{
> >>>> +       bool supported = false;
> >>>> +       u64 bits;
> >>>> +
> >>>> +       *mask = 0;
> >>>> +       *qwords = 0;
> >>>> +
> >>>> +       switch (reg) {
> >>>> +       case PERF_REG_X86_XMM:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_XMM_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       case PERF_REG_X86_YMM:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_YMM_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       case PERF_REG_X86_ZMM:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >>>> +                       break;
> >>>> +               }
> >>>> +
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       default:
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       return supported;
> >>>> +}
> >>>> +
> >>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>> +{
> >>>> +       bool supported = false;
> >>>> +       u64 bits;
> >>>> +
> >>>> +       *mask = 0;
> >>>> +       *qwords = 0;
> >>>> +
> >>>> +       switch (reg) {
> >>>> +       case PERF_REG_X86_OPMASK:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       default:
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       return supported;
> >>>> +}
> >>>> +
> >>>> +static bool has_cap_simd_regs(void)
> >>>> +{
> >>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >>>> +       static bool has_cap_simd_regs;
> >>>> +       static bool cached;
> >>>> +
> >>>> +       if (cached)
> >>>> +               return has_cap_simd_regs;
> >>>> +
> >>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>> +       cached = true;
> >>>> +
> >>>> +       return has_cap_simd_regs;
> >>>> +}
> >>>> +
> >>>> +bool arch_has_simd_regs(u64 mask)
> >>>> +{
> >>>> +       return has_cap_simd_regs() &&
> >>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >>>> +}
> >>>> +
> >>>> +static const struct sample_reg sample_simd_reg_masks[] = {
> >>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >>>> +       SMPL_REG_END
> >>>> +};
> >>>> +
> >>>> +static const struct sample_reg sample_pred_reg_masks[] = {
> >>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >>>> +       SMPL_REG_END
> >>>> +};
> >>>> +
> >>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >>>> +{
> >>>> +       return sample_simd_reg_masks;
> >>>> +}
> >>>> +
> >>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >>>> +{
> >>>> +       return sample_pred_reg_masks;
> >>>> +}
> >>>> +
> >>>> +static bool x86_intr_simd_updated;
> >>>> +static u64 x86_intr_simd_reg_mask;
> >>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>> Could we add some comments? I can kind of figure out the updated is a
> >>> check for lazy initialization and what masks are, qwords is an odd
> >>> one. The comment could also point out that SIMD doesn't mean the
> >>> machine supports SIMD, but SIMD registers are supported in perf
> >>> events.
> >> Sure.
> >>
> >>
> >>>> +static bool x86_user_simd_updated;
> >>>> +static u64 x86_user_simd_reg_mask;
> >>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>> +
> >>>> +static bool x86_intr_pred_updated;
> >>>> +static u64 x86_intr_pred_reg_mask;
> >>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +static bool x86_user_pred_updated;
> >>>> +static u64 x86_user_pred_reg_mask;
> >>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +
> >>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool supported;
> >>>> +       u64 mask = 0;
> >>>> +       int reg;
> >>>> +
> >>>> +       if (!has_cap_simd_regs())
> >>>> +               return 0;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >>>> +               return x86_intr_simd_reg_mask;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >>>> +               return x86_user_simd_reg_mask;
> >>>> +
> >>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>> +               supported = false;
> >>>> +
> >>>> +               if (!r->mask)
> >>>> +                       continue;
> >>>> +               reg = fls64(r->mask) - 1;
> >>>> +
> >>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >>>> +                       break;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_intr_simd_mask[reg],
> >>>> +                                                        &x86_intr_simd_qwords[reg]);
> >>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_user_simd_mask[reg],
> >>>> +                                                        &x86_user_simd_qwords[reg]);
> >>>> +               if (supported)
> >>>> +                       mask |= BIT_ULL(reg);
> >>>> +       }
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>> +               x86_intr_simd_reg_mask = mask;
> >>>> +               x86_intr_simd_updated = true;
> >>>> +       } else {
> >>>> +               x86_user_simd_reg_mask = mask;
> >>>> +               x86_user_simd_updated = true;
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>>> +
> >>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool supported;
> >>>> +       u64 mask = 0;
> >>>> +       int reg;
> >>>> +
> >>>> +       if (!has_cap_simd_regs())
> >>>> +               return 0;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >>>> +               return x86_intr_pred_reg_mask;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >>>> +               return x86_user_pred_reg_mask;
> >>>> +
> >>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>> +               supported = false;
> >>>> +
> >>>> +               if (!r->mask)
> >>>> +                       continue;
> >>>> +               reg = fls64(r->mask) - 1;
> >>>> +
> >>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >>>> +                       break;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_intr_pred_mask[reg],
> >>>> +                                                        &x86_intr_pred_qwords[reg]);
> >>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_user_pred_mask[reg],
> >>>> +                                                        &x86_user_pred_qwords[reg]);
> >>>> +               if (supported)
> >>>> +                       mask |= BIT_ULL(reg);
> >>>> +       }
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>> +               x86_intr_pred_reg_mask = mask;
> >>>> +               x86_intr_pred_updated = true;
> >>>> +       } else {
> >>>> +               x86_user_pred_reg_mask = mask;
> >>>> +               x86_user_pred_updated = true;
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>> This feels repetitive with __arch__simd_reg_mask, could they be
> >>> refactored together?
> >> hmm, it looks we can extract the for loop as a common function. The other
> >> parts are hard to be generalized since they are manipulating different
> >> variables. If we want to generalize them, we have to introduce lots of "if
> >> ... else" branches and that would make code hard to be read.
> >>
> >>
> >>>> +
> >>>> +uint64_t arch__intr_simd_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_simd_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_pred_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_pred_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>> +}
> >>>> +
> >>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>> +{
> >>>> +       uint64_t mask = 0;
> >>>> +
> >>>> +       *qwords = 0;
> >>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >>>> +               if (intr) {
> >>>> +                       *qwords = x86_intr_simd_qwords[reg];
> >>>> +                       mask = x86_intr_simd_mask[reg];
> >>>> +               } else {
> >>>> +                       *qwords = x86_user_simd_qwords[reg];
> >>>> +                       mask = x86_user_simd_mask[reg];
> >>>> +               }
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>>> +
> >>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>> +{
> >>>> +       uint64_t mask = 0;
> >>>> +
> >>>> +       *qwords = 0;
> >>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >>>> +               if (intr) {
> >>>> +                       *qwords = x86_intr_pred_qwords[reg];
> >>>> +                       mask = x86_intr_pred_mask[reg];
> >>>> +               } else {
> >>>> +                       *qwords = x86_user_pred_qwords[reg];
> >>>> +                       mask = x86_user_pred_mask[reg];
> >>>> +               }
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_intr_simd_updated)
> >>>> +               arch__intr_simd_reg_mask();
> >>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_user_simd_updated)
> >>>> +               arch__user_simd_reg_mask();
> >>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_intr_pred_updated)
> >>>> +               arch__intr_pred_reg_mask();
> >>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_user_pred_updated)
> >>>> +               arch__user_pred_reg_mask();
> >>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >>>> +}
> >>>> +
> >>>>  const struct sample_reg *arch__sample_reg_masks(void)
> >>>>  {
> >>>> +       if (has_cap_simd_regs())
> >>>> +               return sample_reg_masks_ext;
> >>>>         return sample_reg_masks;
> >>>>  }
> >>>>
> >>>> -uint64_t arch__intr_reg_mask(void)
> >>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>>>  {
> >>>>         struct perf_event_attr attr = {
> >>>> -               .type                   = PERF_TYPE_HARDWARE,
> >>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >>>> -               .precise_ip             = 1,
> >>>> -               .disabled               = 1,
> >>>> -               .exclude_kernel         = 1,
> >>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>> +               .sample_type                    = sample_type,
> >>>> +               .precise_ip                     = 1,
> >>>> +               .disabled                       = 1,
> >>>> +               .exclude_kernel                 = 1,
> >>>> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>>>         };
> >>>>         int fd;
> >>>>         /*
> >>>>          * In an unnamed union, init it here to build on older gcc versions
> >>>>          */
> >>>>         attr.sample_period = 1;
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +               attr.sample_regs_intr = mask;
> >>>> +       else
> >>>> +               attr.sample_regs_user = mask;
> >>>>
> >>>>         if (perf_pmus__num_core_pmus() > 1) {
> >>>>                 struct perf_pmu *pmu = NULL;
> >>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>         if (fd != -1) {
> >>>>                 close(fd);
> >>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >>>> +               return mask;
> >>>>         }
> >>>>
> >>>> -       return PERF_REGS_MASK;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_reg_mask(void)
> >>>> +{
> >>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>> +
> >>>> +       if (has_cap_simd_regs()) {
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>> +                                        true);
> >>> It's nice to label constant arguments like this something like:
> >>> /*has_simd_regs=*/true);
> >>>
> >>> Tools like clang-tidy even try to enforce the argument names match the comments.
> >> Sure.
> >>
> >>
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>> +                                        true);
> >>>> +       } else
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >>>> +
> >>>> +       return mask;
> >>>>  }
> >>>>
> >>>>  uint64_t arch__user_reg_mask(void)
> >>>>  {
> >>>> -       return PERF_REGS_MASK;
> >>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>> +
> >>>> +       if (has_cap_simd_regs()) {
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>> +                                        true);
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>> +                                        true);
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>> The code is repetitive here, could we refactor into a single function
> >>> passing in a user or instr value?
> >> Sure. Would extract the common part.
> >>
> >>
> >>>>  }
> >>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>> index 56ebefd075f2..5d1d90cf9488 100644
> >>>> --- a/tools/perf/util/evsel.c
> >>>> +++ b/tools/perf/util/evsel.c
> >>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>>>             !evsel__is_dummy_event(evsel)) {
> >>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >>>> +       }
> >>>> +
> >>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >>>> +               if (opts->sample_pred_regs_qwords)
> >>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>> +               else
> >>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>         }
> >>>>
> >>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>>>             !evsel__is_dummy_event(evsel)) {
> >>>>                 attr->sample_regs_user |= opts->sample_user_regs;
> >>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >>>> +               evsel__set_sample_bit(evsel, REGS_USER);
> >>>> +       }
> >>>> +
> >>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>> +               if (opts->sample_pred_regs_qwords)
> >>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>> +               else
> >>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>>>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>>>         }
> >>>>
> >>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >>>> index cda1c620968e..0bd100392889 100644
> >>>> --- a/tools/perf/util/parse-regs-options.c
> >>>> +++ b/tools/perf/util/parse-regs-options.c
> >>>> @@ -4,19 +4,139 @@
> >>>>  #include <stdint.h>
> >>>>  #include <string.h>
> >>>>  #include <stdio.h>
> >>>> +#include <linux/bitops.h>
> >>>>  #include "util/debug.h"
> >>>>  #include <subcmd/parse-options.h>
> >>>>  #include "util/perf_regs.h"
> >>>>  #include "util/parse-regs-options.h"
> >>>> +#include "record.h"
> >>>> +
> >>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       if (!simd_mask)
> >>>> +               return;
> >>>> +
> >>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>> +               if (!(r->mask & simd_mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               if (bitmap)
> >>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>> +       }
> >>>> +}
> >>>> +
> >>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       if (!pred_mask)
> >>>> +               return;
> >>>> +
> >>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>> +               if (!(r->mask & pred_mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               if (bitmap)
> >>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>> +       }
> >>>> +}
> >>>> +
> >>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool matched = false;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>> +               if (strcasecmp(s, r->name))
> >>>> +                       continue;
> >>>> +               if (!fls64(r->mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               matched = true;
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       /* Just need the highest qwords */
> >>> I'm not following here. Does the bitmap need to handle gaps?
> >> Currently no. In theory, the kernel supports user space only samples a
> >> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
> >> supports 16 XMM registers on XMM), but it's not supported to avoid
> >> introducing too much complexity in perf tools. Moreover, I don't think end
> >> users have such requirement. In most cases, users should only know which
> >> kinds of SIMD registers their programs use but usually doesn't know and
> >> care about which exact SIMD register is used.
> >>
> >>
> >>>> +       if (qwords > opts->sample_vec_regs_qwords) {
> >>>> +               opts->sample_vec_regs_qwords = qwords;
> >>>> +               if (intr)
> >>>> +                       opts->sample_intr_vec_regs = bitmap;
> >>>> +               else
> >>>> +                       opts->sample_user_vec_regs = bitmap;
> >>>> +       }
> >>>> +
> >>>> +       return matched;
> >>>> +}
> >>>> +
> >>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool matched = false;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>> +               if (strcasecmp(s, r->name))
> >>>> +                       continue;
> >>>> +               if (!fls64(r->mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               matched = true;
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       /* Just need the highest qwords */
> >>> Again repetitive, could we have a single function?
> >> Yes, I suppose the for loop at least can be extracted as a common function.
> >>
> >>
> >>>> +       if (qwords > opts->sample_pred_regs_qwords) {
> >>>> +               opts->sample_pred_regs_qwords = qwords;
> >>>> +               if (intr)
> >>>> +                       opts->sample_intr_pred_regs = bitmap;
> >>>> +               else
> >>>> +                       opts->sample_user_pred_regs = bitmap;
> >>>> +       }
> >>>> +
> >>>> +       return matched;
> >>>> +}
> >>>>
> >>>>  static int
> >>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>  {
> >>>>         uint64_t *mode = (uint64_t *)opt->value;
> >>>>         const struct sample_reg *r = NULL;
> >>>> +       struct record_opts *opts;
> >>>>         char *s, *os = NULL, *p;
> >>>> -       int ret = -1;
> >>>> +       bool has_simd_regs = false;
> >>>>         uint64_t mask;
> >>>> +       uint64_t simd_mask;
> >>>> +       uint64_t pred_mask;
> >>>> +       int ret = -1;
> >>>>
> >>>>         if (unset)
> >>>>                 return 0;
> >>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>         if (*mode)
> >>>>                 return -1;
> >>>>
> >>>> -       if (intr)
> >>>> +       if (intr) {
> >>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>>>                 mask = arch__intr_reg_mask();
> >>>> -       else
> >>>> +               simd_mask = arch__intr_simd_reg_mask();
> >>>> +               pred_mask = arch__intr_pred_reg_mask();
> >>>> +       } else {
> >>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>>>                 mask = arch__user_reg_mask();
> >>>> +               simd_mask = arch__user_simd_reg_mask();
> >>>> +               pred_mask = arch__user_pred_reg_mask();
> >>>> +       }
> >>>>
> >>>>         /* str may be NULL in case no arg is passed to -I */
> >>>>         if (str) {
> >>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>                                         if (r->mask & mask)
> >>>>                                                 fprintf(stderr, "%s ", r->name);
> >>>>                                 }
> >>>> +                               __print_simd_regs(intr, simd_mask);
> >>>> +                               __print_pred_regs(intr, pred_mask);
> >>>>                                 fputc('\n', stderr);
> >>>>                                 /* just printing available regs */
> >>>>                                 goto error;
> >>>>                         }
> >>>> +
> >>>> +                       if (simd_mask) {
> >>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >>>> +                               if (has_simd_regs)
> >>>> +                                       goto next;
> >>>> +                       }
> >>>> +                       if (pred_mask) {
> >>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >>>> +                               if (has_simd_regs)
> >>>> +                                       goto next;
> >>>> +                       }
> >>>> +
> >>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>>>                                         break;
> >>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>                         }
> >>>>
> >>>>                         *mode |= r->mask;
> >>>> -
> >>>> +next:
> >>>>                         if (!p)
> >>>>                                 break;
> >>>>
> >>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>         ret = 0;
> >>>>
> >>>>         /* default to all possible regs */
> >>>> -       if (*mode == 0)
> >>>> +       if (*mode == 0 && !has_simd_regs)
> >>>>                 *mode = mask;
> >>>>  error:
> >>>>         free(os);
> >>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >>>> index 66b666d9ce64..fb0366d050cf 100644
> >>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>>>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>>>         PRINT_ATTRf(aux_resume, p_unsigned);
> >>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>>>
> >>>>         return ret;
> >>>>  }
> >>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >>>> index 44b90bbf2d07..e8a9fabc92e6 100644
> >>>> --- a/tools/perf/util/perf_regs.c
> >>>> +++ b/tools/perf/util/perf_regs.c
> >>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>>>         return SDT_ARG_SKIP;
> >>>>  }
> >>>>
> >>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >>>> +{
> >>>> +       return false;
> >>>> +}
> >>>> +
> >>>>  uint64_t __weak arch__intr_reg_mask(void)
> >>>>  {
> >>>>         return 0;
> >>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>>>         return 0;
> >>>>  }
> >>>>
> >>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_simd_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_pred_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>         SMPL_REG_END
> >>>>  };
> >>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>>>         return sample_reg_masks;
> >>>>  }
> >>>>
> >>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >>>> +{
> >>>> +       return sample_reg_masks;
> >>>> +}
> >>>> +
> >>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >>>> +{
> >>>> +       return sample_reg_masks;
> >>>> +}
> >>> Thinking out loud. I wonder if there is a way to hide the weak
> >>> functions. It seems the support is tied to PMUs, particularly core
> >>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
> >>> ask the PMU to parse the register strings, set up the perf_event_attr,
> >>> etc. I'm somewhat scared these functions will be used on the report
> >>> rather than record side of things, thereby breaking perf.data support
> >>> when the host kernel does or doesn't have the SIMD support.
> >> Ian, I don't quite follow your words.
> >>
> >> I don't quite understand how should we do for "push things into pmu and
> >> arch pmu code". Current SIMD registers support follows the same way of the
> >> general registers support. If we intend to change the way entirely, we'd
> >> better have an independent patch-set.
> >>
> >> why these functions would break the perf.data repport? perf-report would
> >> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
> >> the flags is set (indicates there are SIMD registers data appended in the
> >> record), perf-report would try to parse the SIMD registers data.
> > Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
> > remove weak symbols like:
> > https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
> >
> > For these patches what I'm imagining is that there is a Nova Lake
> > generated perf.data file. Using perf report, script, etc. on the Nova
> > Lake should expose all of the same mask, qword, etc. values as when
> > the perf.data was generated and so things will work. If the perf.data
> > file was taken to say my Alderlake then what will happen? Generally
> > using the arch directory and weak symbols is a code smell that cross
> > platform things are going to break - there should be sufficient data
> > in the event and the perf_event_attr to fully decode what's going on.
> > Sometimes tying things to a PMU name can avoid the use of the arch
> > directory. We were able to avoid the arch directory to a good extent
> > for the TPEBS code, even though it is a very modern Intel feature.
>
> I see.
>
> But the sampling support for SIMD registers is different with the sample
> weight processing in the patch
> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
> Each arch may support different kinds of SIMD registers and furthermore
> each kind of SIMD register may have different register number and register
> width. It's quite hard to figure out some common functions or fields to
> represent the name and attributes of these arch-specific SIMD registers.
> These arch-specific information can only be told by the arch-specific code.
> So it looks the __weak functions are still the easiest way to implement this.
>
> I don't think the perf.data parsing would be broken from a platform to
> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
> To indicates the presence of SIMD registers in record data, a new ABI flag
> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
> platform is new enough and can recognize this new flag, then the SIMD
> registers data would be parsed correctly. Even though the perf tool is old
> and have no support of SIMD register, the data of SIMD registers would just
> be silently ignored and should not break the parsing.

That's good to know. I'm confused then why these functions can't just
be within the arch directory? For example, we don't expose the
intel-pt PMU code in the common code except for the parsing parts. A
lot of that is handled by the default perf_event_attr initialization
that every PMU can have its own variant of:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123

Perhaps this is all just evidence of tech debt in the perf_regs.c code
:-/ The bit that's relevant to the patch here is that I think this is
adding to the tech debt problem as 11 more functions are added to
perf_regs.h.

Thanks,
Ian

> >
> > Thanks,
> > Ian
> >
> >
> >
> >>> Thanks,
> >>> Ian
> >>>
> >>>> +
> >>>>  const char *perf_reg_name(int id, const char *arch)
> >>>>  {
> >>>>         const char *reg_name = NULL;
> >>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >>>> index f2d0736d65cc..bce9c4cfd1bf 100644
> >>>> --- a/tools/perf/util/perf_regs.h
> >>>> +++ b/tools/perf/util/perf_regs.h
> >>>> @@ -24,9 +24,20 @@ enum {
> >>>>  };
> >>>>
> >>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >>>> +bool arch_has_simd_regs(u64 mask);
> >>>>  uint64_t arch__intr_reg_mask(void);
> >>>>  uint64_t arch__user_reg_mask(void);
> >>>>  const struct sample_reg *arch__sample_reg_masks(void);
> >>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> >>>> +uint64_t arch__intr_simd_reg_mask(void);
> >>>> +uint64_t arch__user_simd_reg_mask(void);
> >>>> +uint64_t arch__intr_pred_reg_mask(void);
> >>>> +uint64_t arch__user_pred_reg_mask(void);
> >>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>
> >>>>  const char *perf_reg_name(int id, const char *arch);
> >>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >>>> index ea3a6c4657ee..825ffb4cc53f 100644
> >>>> --- a/tools/perf/util/record.h
> >>>> +++ b/tools/perf/util/record.h
> >>>> @@ -59,7 +59,13 @@ struct record_opts {
> >>>>         unsigned int  user_freq;
> >>>>         u64           branch_stack;
> >>>>         u64           sample_intr_regs;
> >>>> +       u64           sample_intr_vec_regs;
> >>>>         u64           sample_user_regs;
> >>>> +       u64           sample_user_vec_regs;
> >>>> +       u16           sample_pred_regs_qwords;
> >>>> +       u16           sample_vec_regs_qwords;
> >>>> +       u16           sample_intr_pred_regs;
> >>>> +       u16           sample_user_pred_regs;
> >>>>         u64           default_interval;
> >>>>         u64           user_interval;
> >>>>         size_t        auxtrace_snapshot_size;
> >>>> --
> >>>> 2.34.1
> >>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-04 15:17   ` Peter Zijlstra
  2025-12-04 15:47     ` Peter Zijlstra
@ 2025-12-04 18:59     ` Dave Hansen
  2025-12-05  8:42       ` Peter Zijlstra
  1 sibling, 1 reply; 86+ messages in thread
From: Dave Hansen @ 2025-12-04 18:59 UTC (permalink / raw)
  To: Peter Zijlstra, Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

[-- Attachment #1: Type: text/plain, Size: 674 bytes --]

On 12/4/25 07:17, Peter Zijlstra wrote:
>> - Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
>>   Some corner cases, such as an NMI occurring just after the flag switches
>>   but still in kernel mode, cannot be handled.
> Urgh.. Dave, Thomas, is there any reason we could not set
> TIF_NEED_FPU_LOAD *after* doing the XSAVE (clearing is already done
> after restore).
> 
> That way, when an NMI sees TIF_NEED_FPU_LOAD it knows the task copy is
> consistent.

Something like the attached patch?

I think that would be just fine. save_fpregs_to_fpstate() doesn't
actually change the need for TIF_NEED_FPU_LOAD, so I don't think the
ordering matters.

[-- Attachment #2: tif-after-xsave.patch --]
[-- Type: text/x-patch, Size: 634 bytes --]

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index 89004f4ca208..2d57a7bf5406 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -36,8 +36,8 @@ static inline void switch_fpu(struct task_struct *old, int cpu)
 	    !(old->flags & (PF_KTHREAD | PF_USER_WORKER))) {
 		struct fpu *old_fpu = x86_task_fpu(old);
 
-		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
 		save_fpregs_to_fpstate(old_fpu);
+		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
 		/*
 		 * The save operation preserved register state, so the
 		 * fpu_fpregs_owner_ctx is still @old_fpu. Store the

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-04 16:16           ` Ian Rogers
@ 2025-12-05  4:00             ` Mi, Dapeng
  2025-12-05  6:38               ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-05  4:00 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 12:16 AM, Ian Rogers wrote:
> On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 12/4/2025 3:49 PM, Ian Rogers wrote:
>>> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>
>>>>>> This patch adds support for the newly introduced SIMD register sampling
>>>>>> format by adding the following functions:
>>>>>>
>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>
>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>>>
>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>>>
>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>>>
>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>>>> OPMASK).
>>>>>>
>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>>>> these newly introduced SIMD registers. Currently, each type of register
>>>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>>>> not supported. For example, all XMM registers are sampled together rather
>>>>>> than sampling only XMM0.
>>>>>>
>>>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>>>
>>>>>> With this patch, all supported sampling registers on x86 platforms are
>>>>>> displayed as follows.
>>>>>>
>>>>>>  $perf record -I?
>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>
>>>>>>  $perf record --user-regs=?
>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>
>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>> ---
>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>>>  tools/perf/util/record.h                  |   6 +
>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>>>> index 12fd93f04802..db41430f3b07 100644
>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>>>> @@ -13,6 +13,49 @@
>>>>>>  #include "../../../util/pmu.h"
>>>>>>  #include "../../../util/pmus.h"
>>>>>>
>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>>>> +#endif
>>>>>> +       SMPL_REG_END
>>>>>> +};
>>>>>> +
>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>>>         return SDT_ARG_VALID;
>>>>>>  }
>>>>>>
>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>>>> To make the code easier to read, it'd be nice to document sample_type,
>>>>> qwords and mask here.
>>>> Sure.
>>>>
>>>>
>>>>>> +{
>>>>>> +       struct perf_event_attr attr = {
>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>> +               .sample_type                    = sample_type,
>>>>>> +               .disabled                       = 1,
>>>>>> +               .exclude_kernel                 = 1,
>>>>>> +               .sample_simd_regs_enabled       = 1,
>>>>>> +       };
>>>>>> +       int fd;
>>>>>> +
>>>>>> +       attr.sample_period = 1;
>>>>>> +
>>>>>> +       if (!pred) {
>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>>>> +               else
>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>>>> +       } else {
>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>>>> +               else
>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>>>> +       }
>>>>>> +
>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>>>> +               struct perf_pmu *pmu = NULL;
>>>>>> +               __u64 type = PERF_TYPE_RAW;
>>>>> It should be okay to do:
>>>>> __u64 type = perf_pmus__find_core_pmu()->type
>>>>> rather than have the whole loop below.
>>>> Sure. Thanks.
>>>>
>>>>
>>>>>> +
>>>>>> +               /*
>>>>>> +                * The same register set is supported among different hybrid PMUs.
>>>>>> +                * Only check the first available one.
>>>>>> +                */
>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>>>> +                       type = pmu->type;
>>>>>> +                       break;
>>>>>> +               }
>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>>>> +       }
>>>>>> +
>>>>>> +       event_attr_init(&attr);
>>>>>> +
>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>> +       if (fd != -1) {
>>>>>> +               close(fd);
>>>>>> +               return true;
>>>>>> +       }
>>>>>> +
>>>>>> +       return false;
>>>>>> +}
>>>>>> +
>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>> +{
>>>>>> +       bool supported = false;
>>>>>> +       u64 bits;
>>>>>> +
>>>>>> +       *mask = 0;
>>>>>> +       *qwords = 0;
>>>>>> +
>>>>>> +       switch (reg) {
>>>>>> +       case PERF_REG_X86_XMM:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       case PERF_REG_X86_YMM:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       case PERF_REG_X86_ZMM:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>>>> +                       break;
>>>>>> +               }
>>>>>> +
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       default:
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       return supported;
>>>>>> +}
>>>>>> +
>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>> +{
>>>>>> +       bool supported = false;
>>>>>> +       u64 bits;
>>>>>> +
>>>>>> +       *mask = 0;
>>>>>> +       *qwords = 0;
>>>>>> +
>>>>>> +       switch (reg) {
>>>>>> +       case PERF_REG_X86_OPMASK:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       default:
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       return supported;
>>>>>> +}
>>>>>> +
>>>>>> +static bool has_cap_simd_regs(void)
>>>>>> +{
>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>>>> +       static bool has_cap_simd_regs;
>>>>>> +       static bool cached;
>>>>>> +
>>>>>> +       if (cached)
>>>>>> +               return has_cap_simd_regs;
>>>>>> +
>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>> +       cached = true;
>>>>>> +
>>>>>> +       return has_cap_simd_regs;
>>>>>> +}
>>>>>> +
>>>>>> +bool arch_has_simd_regs(u64 mask)
>>>>>> +{
>>>>>> +       return has_cap_simd_regs() &&
>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>>>> +}
>>>>>> +
>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>>>> +       SMPL_REG_END
>>>>>> +};
>>>>>> +
>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>>>> +       SMPL_REG_END
>>>>>> +};
>>>>>> +
>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_simd_reg_masks;
>>>>>> +}
>>>>>> +
>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_pred_reg_masks;
>>>>>> +}
>>>>>> +
>>>>>> +static bool x86_intr_simd_updated;
>>>>>> +static u64 x86_intr_simd_reg_mask;
>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>> Could we add some comments? I can kind of figure out the updated is a
>>>>> check for lazy initialization and what masks are, qwords is an odd
>>>>> one. The comment could also point out that SIMD doesn't mean the
>>>>> machine supports SIMD, but SIMD registers are supported in perf
>>>>> events.
>>>> Sure.
>>>>
>>>>
>>>>>> +static bool x86_user_simd_updated;
>>>>>> +static u64 x86_user_simd_reg_mask;
>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>> +
>>>>>> +static bool x86_intr_pred_updated;
>>>>>> +static u64 x86_intr_pred_reg_mask;
>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +static bool x86_user_pred_updated;
>>>>>> +static u64 x86_user_pred_reg_mask;
>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +
>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool supported;
>>>>>> +       u64 mask = 0;
>>>>>> +       int reg;
>>>>>> +
>>>>>> +       if (!has_cap_simd_regs())
>>>>>> +               return 0;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>>>> +               return x86_intr_simd_reg_mask;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>>>> +               return x86_user_simd_reg_mask;
>>>>>> +
>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>> +               supported = false;
>>>>>> +
>>>>>> +               if (!r->mask)
>>>>>> +                       continue;
>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>> +
>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>>>> +                       break;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_intr_simd_mask[reg],
>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_user_simd_mask[reg],
>>>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>>>> +               if (supported)
>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>> +       }
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>> +               x86_intr_simd_reg_mask = mask;
>>>>>> +               x86_intr_simd_updated = true;
>>>>>> +       } else {
>>>>>> +               x86_user_simd_reg_mask = mask;
>>>>>> +               x86_user_simd_updated = true;
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>>> +
>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool supported;
>>>>>> +       u64 mask = 0;
>>>>>> +       int reg;
>>>>>> +
>>>>>> +       if (!has_cap_simd_regs())
>>>>>> +               return 0;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>>>> +               return x86_intr_pred_reg_mask;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>>>> +               return x86_user_pred_reg_mask;
>>>>>> +
>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>> +               supported = false;
>>>>>> +
>>>>>> +               if (!r->mask)
>>>>>> +                       continue;
>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>> +
>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>>>> +                       break;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_intr_pred_mask[reg],
>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_user_pred_mask[reg],
>>>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>>>> +               if (supported)
>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>> +       }
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>> +               x86_intr_pred_reg_mask = mask;
>>>>>> +               x86_intr_pred_updated = true;
>>>>>> +       } else {
>>>>>> +               x86_user_pred_reg_mask = mask;
>>>>>> +               x86_user_pred_updated = true;
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>> This feels repetitive with __arch__simd_reg_mask, could they be
>>>>> refactored together?
>>>> hmm, it looks we can extract the for loop as a common function. The other
>>>> parts are hard to be generalized since they are manipulating different
>>>> variables. If we want to generalize them, we have to introduce lots of "if
>>>> ... else" branches and that would make code hard to be read.
>>>>
>>>>
>>>>>> +
>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>> +}
>>>>>> +
>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>> +{
>>>>>> +       uint64_t mask = 0;
>>>>>> +
>>>>>> +       *qwords = 0;
>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>>>> +               if (intr) {
>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>>>> +                       mask = x86_intr_simd_mask[reg];
>>>>>> +               } else {
>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>>>> +                       mask = x86_user_simd_mask[reg];
>>>>>> +               }
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>>> +
>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>> +{
>>>>>> +       uint64_t mask = 0;
>>>>>> +
>>>>>> +       *qwords = 0;
>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>>>> +               if (intr) {
>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>>>> +                       mask = x86_intr_pred_mask[reg];
>>>>>> +               } else {
>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>>>> +                       mask = x86_user_pred_mask[reg];
>>>>>> +               }
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_intr_simd_updated)
>>>>>> +               arch__intr_simd_reg_mask();
>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_user_simd_updated)
>>>>>> +               arch__user_simd_reg_mask();
>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_intr_pred_updated)
>>>>>> +               arch__intr_pred_reg_mask();
>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_user_pred_updated)
>>>>>> +               arch__user_pred_reg_mask();
>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>>>> +}
>>>>>> +
>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>>>  {
>>>>>> +       if (has_cap_simd_regs())
>>>>>> +               return sample_reg_masks_ext;
>>>>>>         return sample_reg_masks;
>>>>>>  }
>>>>>>
>>>>>> -uint64_t arch__intr_reg_mask(void)
>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>>>  {
>>>>>>         struct perf_event_attr attr = {
>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>>>> -               .precise_ip             = 1,
>>>>>> -               .disabled               = 1,
>>>>>> -               .exclude_kernel         = 1,
>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>> +               .sample_type                    = sample_type,
>>>>>> +               .precise_ip                     = 1,
>>>>>> +               .disabled                       = 1,
>>>>>> +               .exclude_kernel                 = 1,
>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>>>         };
>>>>>>         int fd;
>>>>>>         /*
>>>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>>>          */
>>>>>>         attr.sample_period = 1;
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +               attr.sample_regs_intr = mask;
>>>>>> +       else
>>>>>> +               attr.sample_regs_user = mask;
>>>>>>
>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>>>                 struct perf_pmu *pmu = NULL;
>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>         if (fd != -1) {
>>>>>>                 close(fd);
>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>>>> +               return mask;
>>>>>>         }
>>>>>>
>>>>>> -       return PERF_REGS_MASK;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_reg_mask(void)
>>>>>> +{
>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>> +
>>>>>> +       if (has_cap_simd_regs()) {
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>> +                                        true);
>>>>> It's nice to label constant arguments like this something like:
>>>>> /*has_simd_regs=*/true);
>>>>>
>>>>> Tools like clang-tidy even try to enforce the argument names match the comments.
>>>> Sure.
>>>>
>>>>
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>> +                                        true);
>>>>>> +       } else
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>>>> +
>>>>>> +       return mask;
>>>>>>  }
>>>>>>
>>>>>>  uint64_t arch__user_reg_mask(void)
>>>>>>  {
>>>>>> -       return PERF_REGS_MASK;
>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>> +
>>>>>> +       if (has_cap_simd_regs()) {
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>> +                                        true);
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>> +                                        true);
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>> The code is repetitive here, could we refactor into a single function
>>>>> passing in a user or instr value?
>>>> Sure. Would extract the common part.
>>>>
>>>>
>>>>>>  }
>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>>>> --- a/tools/perf/util/evsel.c
>>>>>> +++ b/tools/perf/util/evsel.c
>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>> +       }
>>>>>> +
>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>> +               else
>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>         }
>>>>>>
>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>>>> +       }
>>>>>> +
>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>> +               else
>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>         }
>>>>>>
>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>>>> index cda1c620968e..0bd100392889 100644
>>>>>> --- a/tools/perf/util/parse-regs-options.c
>>>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>>>> @@ -4,19 +4,139 @@
>>>>>>  #include <stdint.h>
>>>>>>  #include <string.h>
>>>>>>  #include <stdio.h>
>>>>>> +#include <linux/bitops.h>
>>>>>>  #include "util/debug.h"
>>>>>>  #include <subcmd/parse-options.h>
>>>>>>  #include "util/perf_regs.h"
>>>>>>  #include "util/parse-regs-options.h"
>>>>>> +#include "record.h"
>>>>>> +
>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       if (!simd_mask)
>>>>>> +               return;
>>>>>> +
>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>> +               if (!(r->mask & simd_mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               if (bitmap)
>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>> +       }
>>>>>> +}
>>>>>> +
>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       if (!pred_mask)
>>>>>> +               return;
>>>>>> +
>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>> +               if (!(r->mask & pred_mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               if (bitmap)
>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>> +       }
>>>>>> +}
>>>>>> +
>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool matched = false;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>> +               if (strcasecmp(s, r->name))
>>>>>> +                       continue;
>>>>>> +               if (!fls64(r->mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               matched = true;
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       /* Just need the highest qwords */
>>>>> I'm not following here. Does the bitmap need to handle gaps?
>>>> Currently no. In theory, the kernel supports user space only samples a
>>>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
>>>> supports 16 XMM registers on XMM), but it's not supported to avoid
>>>> introducing too much complexity in perf tools. Moreover, I don't think end
>>>> users have such requirement. In most cases, users should only know which
>>>> kinds of SIMD registers their programs use but usually doesn't know and
>>>> care about which exact SIMD register is used.
>>>>
>>>>
>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>>>> +               if (intr)
>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>>>> +               else
>>>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>>>> +       }
>>>>>> +
>>>>>> +       return matched;
>>>>>> +}
>>>>>> +
>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool matched = false;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>> +               if (strcasecmp(s, r->name))
>>>>>> +                       continue;
>>>>>> +               if (!fls64(r->mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               matched = true;
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       /* Just need the highest qwords */
>>>>> Again repetitive, could we have a single function?
>>>> Yes, I suppose the for loop at least can be extracted as a common function.
>>>>
>>>>
>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>>>> +               if (intr)
>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>>>> +               else
>>>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>>>> +       }
>>>>>> +
>>>>>> +       return matched;
>>>>>> +}
>>>>>>
>>>>>>  static int
>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>  {
>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>>>         const struct sample_reg *r = NULL;
>>>>>> +       struct record_opts *opts;
>>>>>>         char *s, *os = NULL, *p;
>>>>>> -       int ret = -1;
>>>>>> +       bool has_simd_regs = false;
>>>>>>         uint64_t mask;
>>>>>> +       uint64_t simd_mask;
>>>>>> +       uint64_t pred_mask;
>>>>>> +       int ret = -1;
>>>>>>
>>>>>>         if (unset)
>>>>>>                 return 0;
>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>         if (*mode)
>>>>>>                 return -1;
>>>>>>
>>>>>> -       if (intr)
>>>>>> +       if (intr) {
>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>>>                 mask = arch__intr_reg_mask();
>>>>>> -       else
>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>>>> +       } else {
>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>>>                 mask = arch__user_reg_mask();
>>>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>>>> +       }
>>>>>>
>>>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>>>         if (str) {
>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>                                         if (r->mask & mask)
>>>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>>>                                 }
>>>>>> +                               __print_simd_regs(intr, simd_mask);
>>>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>>>                                 fputc('\n', stderr);
>>>>>>                                 /* just printing available regs */
>>>>>>                                 goto error;
>>>>>>                         }
>>>>>> +
>>>>>> +                       if (simd_mask) {
>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>>>> +                               if (has_simd_regs)
>>>>>> +                                       goto next;
>>>>>> +                       }
>>>>>> +                       if (pred_mask) {
>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>>>> +                               if (has_simd_regs)
>>>>>> +                                       goto next;
>>>>>> +                       }
>>>>>> +
>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>>>                                         break;
>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>                         }
>>>>>>
>>>>>>                         *mode |= r->mask;
>>>>>> -
>>>>>> +next:
>>>>>>                         if (!p)
>>>>>>                                 break;
>>>>>>
>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>         ret = 0;
>>>>>>
>>>>>>         /* default to all possible regs */
>>>>>> -       if (*mode == 0)
>>>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>>>                 *mode = mask;
>>>>>>  error:
>>>>>>         free(os);
>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>>>
>>>>>>         return ret;
>>>>>>  }
>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>>>> --- a/tools/perf/util/perf_regs.c
>>>>>> +++ b/tools/perf/util/perf_regs.c
>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>>>         return SDT_ARG_SKIP;
>>>>>>  }
>>>>>>
>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>>>> +{
>>>>>> +       return false;
>>>>>> +}
>>>>>> +
>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>>>  {
>>>>>>         return 0;
>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>>>         return 0;
>>>>>>  }
>>>>>>
>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>         SMPL_REG_END
>>>>>>  };
>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>>>         return sample_reg_masks;
>>>>>>  }
>>>>>>
>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_reg_masks;
>>>>>> +}
>>>>>> +
>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_reg_masks;
>>>>>> +}
>>>>> Thinking out loud. I wonder if there is a way to hide the weak
>>>>> functions. It seems the support is tied to PMUs, particularly core
>>>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
>>>>> ask the PMU to parse the register strings, set up the perf_event_attr,
>>>>> etc. I'm somewhat scared these functions will be used on the report
>>>>> rather than record side of things, thereby breaking perf.data support
>>>>> when the host kernel does or doesn't have the SIMD support.
>>>> Ian, I don't quite follow your words.
>>>>
>>>> I don't quite understand how should we do for "push things into pmu and
>>>> arch pmu code". Current SIMD registers support follows the same way of the
>>>> general registers support. If we intend to change the way entirely, we'd
>>>> better have an independent patch-set.
>>>>
>>>> why these functions would break the perf.data repport? perf-report would
>>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
>>>> the flags is set (indicates there are SIMD registers data appended in the
>>>> record), perf-report would try to parse the SIMD registers data.
>>> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
>>> remove weak symbols like:
>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
>>>
>>> For these patches what I'm imagining is that there is a Nova Lake
>>> generated perf.data file. Using perf report, script, etc. on the Nova
>>> Lake should expose all of the same mask, qword, etc. values as when
>>> the perf.data was generated and so things will work. If the perf.data
>>> file was taken to say my Alderlake then what will happen? Generally
>>> using the arch directory and weak symbols is a code smell that cross
>>> platform things are going to break - there should be sufficient data
>>> in the event and the perf_event_attr to fully decode what's going on.
>>> Sometimes tying things to a PMU name can avoid the use of the arch
>>> directory. We were able to avoid the arch directory to a good extent
>>> for the TPEBS code, even though it is a very modern Intel feature.
>> I see.
>>
>> But the sampling support for SIMD registers is different with the sample
>> weight processing in the patch
>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
>> Each arch may support different kinds of SIMD registers and furthermore
>> each kind of SIMD register may have different register number and register
>> width. It's quite hard to figure out some common functions or fields to
>> represent the name and attributes of these arch-specific SIMD registers.
>> These arch-specific information can only be told by the arch-specific code.
>> So it looks the __weak functions are still the easiest way to implement this.
>>
>> I don't think the perf.data parsing would be broken from a platform to
>> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
>> To indicates the presence of SIMD registers in record data, a new ABI flag
>> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
>> platform is new enough and can recognize this new flag, then the SIMD
>> registers data would be parsed correctly. Even though the perf tool is old
>> and have no support of SIMD register, the data of SIMD registers would just
>> be silently ignored and should not break the parsing.
> That's good to know. I'm confused then why these functions can't just
> be within the arch directory? For example, we don't expose the
> intel-pt PMU code in the common code except for the parsing parts. A
> lot of that is handled by the default perf_event_attr initialization
> that every PMU can have its own variant of:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123

I see. From my point of view, there seems no essential difference between a
function pointer and a __weak function, and it looks hard to find a common
data structure to save all these function pointers which needs to be called
in different places, like register name parsing, register data dumpling ...


>
> Perhaps this is all just evidence of tech debt in the perf_regs.c code
> :-/ The bit that's relevant to the patch here is that I think this is
> adding to the tech debt problem as 11 more functions are added to
> perf_regs.h.

Yeah, 11 new __weak functions seems too much, we may merge the same kinds
of functions, like merging *_simd_reg_mask() and  *_pred_reg_mask() to a
single function with an type argument, then the new added __weak functions
could shrink half.


>
> Thanks,
> Ian
>
>>> Thanks,
>>> Ian
>>>
>>>
>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>> +
>>>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>>>  {
>>>>>>         const char *reg_name = NULL;
>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>>>> --- a/tools/perf/util/perf_regs.h
>>>>>> +++ b/tools/perf/util/perf_regs.h
>>>>>> @@ -24,9 +24,20 @@ enum {
>>>>>>  };
>>>>>>
>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>>>> +bool arch_has_simd_regs(u64 mask);
>>>>>>  uint64_t arch__intr_reg_mask(void);
>>>>>>  uint64_t arch__user_reg_mask(void);
>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>
>>>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>>>> --- a/tools/perf/util/record.h
>>>>>> +++ b/tools/perf/util/record.h
>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>>>         unsigned int  user_freq;
>>>>>>         u64           branch_stack;
>>>>>>         u64           sample_intr_regs;
>>>>>> +       u64           sample_intr_vec_regs;
>>>>>>         u64           sample_user_regs;
>>>>>> +       u64           sample_user_vec_regs;
>>>>>> +       u16           sample_pred_regs_qwords;
>>>>>> +       u16           sample_vec_regs_qwords;
>>>>>> +       u16           sample_intr_pred_regs;
>>>>>> +       u16           sample_user_pred_regs;
>>>>>>         u64           default_interval;
>>>>>>         u64           user_interval;
>>>>>>         size_t        auxtrace_snapshot_size;
>>>>>> --
>>>>>> 2.34.1
>>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-04 15:47     ` Peter Zijlstra
@ 2025-12-05  6:37       ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-05  6:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/4/2025 11:47 PM, Peter Zijlstra wrote:
> On Thu, Dec 04, 2025 at 04:17:35PM +0100, Peter Zijlstra wrote:
>> On Wed, Dec 03, 2025 at 02:54:47PM +0800, Dapeng Mi wrote:
>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>
>>> While collecting XMM registers in a PEBS record has been supported since
>>> Icelake, non-PEBS events have lacked this feature. By leveraging the
>>> xsaves instruction, it is now possible to snapshot XMM registers for
>>> non-PEBS events, completing the feature set.
>>>
>>> To utilize the xsaves instruction, a 64-byte aligned buffer is required.
>>> A per-CPU ext_regs_buf is added to store SIMD and other registers, with
>>> the buffer size being approximately 2K. The buffer is allocated using
>>> kzalloc_node(), ensuring natural alignment and 64-byte alignment for all
>>> kmalloc() allocations with powers of 2.
>>>
>>> The XMM sampling support is extended for both REGS_USER and REGS_INTR.
>>> For REGS_USER, perf_get_regs_user() returns the registers from
>>> task_pt_regs(current), which is a pt_regs structure. It needs to be
>>> copied to user space secific x86_user_regs structure since kernel may
>>> modify pt_regs structure later.
>>>
>>> For PEBS, XMM registers are retrieved from PEBS records.
>>>
>>> In cases where userspace tasks are trapped within kernel mode (e.g.,
>>> during a syscall) when an NMI arrives, pt_regs information can still be
>>> retrieved from task_pt_regs(). However, capturing SIMD and other
>>> xsave-based registers in this scenario is challenging. Therefore,
>>> snapshots for these registers are omitted in such cases.
>>>
>>> The reasons are:
>>> - Profiling a userspace task that requires SIMD/eGPR registers typically
>>>   involves NMIs hitting userspace, not kernel mode.
>>> - Although it is possible to retrieve values when the TIF_NEED_FPU_LOAD
>>>   flag is set, the complexity introduced to handle this uncommon case in
>>>   the critical path is not justified.
>>> - Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
>>>   Some corner cases, such as an NMI occurring just after the flag switches
>>>   but still in kernel mode, cannot be handled.
>> Urgh.. Dave, Thomas, is there any reason we could not set
>> TIF_NEED_FPU_LOAD *after* doing the XSAVE (clearing is already done
>> after restore).
>>
>> That way, when an NMI sees TIF_NEED_FPU_LOAD it knows the task copy is
>> consistent.
>>
>> I'm not at all sure this is complex, it just needs a little care.
>>
>> And then there is the deferred thing, just like unwind, we can defer
>> REGS_USER/STACK_USER much the same, except someone went and built all
>> that deferred stuff with unwind all tangled into it :/
> With something like the below, the NMI could do something like:
>
> 	struct xregs_state *xr = NULL;
>
> 	/*
> 	 * fpu code does:
> 	 *  XSAVE
> 	 *  set_thread_flag(TIF_NEED_FPU_LOAD)
> 	 *  ...
> 	 *  XRSTOR
> 	 *  clear_thread_flag(TIF_NEED_FPU_LOAD)
> 	 * therefore, when TIF_NEED_FPU_LOAD, the task fpu state holds a
> 	 * whole copy.
> 	 */
> 	if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
> 		struct fpu *fpu = x86_task_fpu(current);
> 		/*
> 		 * If __task_fpstate is set, it holds the right pointer,
> 		 * otherwise fpstate will.
> 		 */
> 		struct fpstate *fps = READ_ONCE(fpu->__task_fpstate);
> 		if (!fps)
> 			fps = fpu->fpstate;
> 		xr = &fps->regs.xregs_state;
> 	} else {
> 		/* like fpu_sync_fpstate(), except NMI local */
> 		xsave_nmi(xr, mask);
> 	}
>
> 	// frob xr into perf data
>
> Or did I miss something? I've not looked at this very long and the above
> was very vague on the actual issues.
>
>
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index da233f20ae6f..0f91a0d7e799 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -359,18 +359,22 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
>  	struct fpstate *cur_fps = fpu->fpstate;
>  
>  	fpregs_lock();
> -	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD))
> +	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD)) {
>  		save_fpregs_to_fpstate(fpu);
> +		set_thread_flag(TIF_NEED_FPU_LOAD);
> +	}
>  
>  	/* Swap fpstate */
>  	if (enter_guest) {
> -		fpu->__task_fpstate = cur_fps;
> +		WRITE_ONCE(fpu->__task_fpstate, cur_fps);
> +		barrier();
>  		fpu->fpstate = guest_fps;
>  		guest_fps->in_use = true;
>  	} else {
>  		guest_fps->in_use = false;
>  		fpu->fpstate = fpu->__task_fpstate;
> -		fpu->__task_fpstate = NULL;
> +		barrier();
> +		WRITE_ONCE(fpu->__task_fpstate, NULL);
>  	}
>  
>  	cur_fps = fpu->fpstate;
> @@ -456,8 +460,8 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)
>  
>  	if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
>  	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
> -		set_thread_flag(TIF_NEED_FPU_LOAD);
>  		save_fpregs_to_fpstate(x86_task_fpu(current));
> +		set_thread_flag(TIF_NEED_FPU_LOAD);
>  	}
>  	__cpu_invalidate_fpregs_state();
>  

Ok, I would involve these changes into next version and support the SIMD
registers sampling for user space registers sampling.



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-05  4:00             ` Mi, Dapeng
@ 2025-12-05  6:38               ` Ian Rogers
  2025-12-05  8:10                 ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2025-12-05  6:38 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Thu, Dec 4, 2025 at 8:00 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 12/5/2025 12:16 AM, Ian Rogers wrote:
> > On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 12/4/2025 3:49 PM, Ian Rogers wrote:
> >>> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
> >>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >>>>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>>>
> >>>>>> This patch adds support for the newly introduced SIMD register sampling
> >>>>>> format by adding the following functions:
> >>>>>>
> >>>>>> uint64_t arch__intr_simd_reg_mask(void);
> >>>>>> uint64_t arch__user_simd_reg_mask(void);
> >>>>>> uint64_t arch__intr_pred_reg_mask(void);
> >>>>>> uint64_t arch__user_pred_reg_mask(void);
> >>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>
> >>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>>>>>
> >>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >>>>>> supported PRED registers, such as OPMASK on x86 platforms.
> >>>>>>
> >>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
> >>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
> >>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>>>>>
> >>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >>>>>> exact bitmap and number of qwords for a specific type of PRED register.
> >>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >>>>>> OPMASK).
> >>>>>>
> >>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
> >>>>>> these newly introduced SIMD registers. Currently, each type of register
> >>>>>> can only be sampled collectively; sampling a specific SIMD register is
> >>>>>> not supported. For example, all XMM registers are sampled together rather
> >>>>>> than sampling only XMM0.
> >>>>>>
> >>>>>> When multiple overlapping register types, such as XMM and YMM, are
> >>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>>>>>
> >>>>>> With this patch, all supported sampling registers on x86 platforms are
> >>>>>> displayed as follows.
> >>>>>>
> >>>>>>  $perf record -I?
> >>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>
> >>>>>>  $perf record --user-regs=?
> >>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>
> >>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>> ---
> >>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>>>>>  tools/perf/util/evsel.c                   |  27 ++
> >>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>>>>>  tools/perf/util/perf_regs.c               |  59 +++
> >>>>>>  tools/perf/util/perf_regs.h               |  11 +
> >>>>>>  tools/perf/util/record.h                  |   6 +
> >>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>>>>>
> >>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>> index 12fd93f04802..db41430f3b07 100644
> >>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
> >>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>> @@ -13,6 +13,49 @@
> >>>>>>  #include "../../../util/pmu.h"
> >>>>>>  #include "../../../util/pmus.h"
> >>>>>>
> >>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
> >>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >>>>>> +#endif
> >>>>>> +       SMPL_REG_END
> >>>>>> +};
> >>>>>> +
> >>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>>>>>         return SDT_ARG_VALID;
> >>>>>>  }
> >>>>>>
> >>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> >>>>> To make the code easier to read, it'd be nice to document sample_type,
> >>>>> qwords and mask here.
> >>>> Sure.
> >>>>
> >>>>
> >>>>>> +{
> >>>>>> +       struct perf_event_attr attr = {
> >>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>> +               .sample_type                    = sample_type,
> >>>>>> +               .disabled                       = 1,
> >>>>>> +               .exclude_kernel                 = 1,
> >>>>>> +               .sample_simd_regs_enabled       = 1,
> >>>>>> +       };
> >>>>>> +       int fd;
> >>>>>> +
> >>>>>> +       attr.sample_period = 1;
> >>>>>> +
> >>>>>> +       if (!pred) {
> >>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
> >>>>>> +               else
> >>>>>> +                       attr.sample_simd_vec_reg_user = mask;
> >>>>>> +       } else {
> >>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >>>>>> +               else
> >>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
> >>>>>> +               struct perf_pmu *pmu = NULL;
> >>>>>> +               __u64 type = PERF_TYPE_RAW;
> >>>>> It should be okay to do:
> >>>>> __u64 type = perf_pmus__find_core_pmu()->type
> >>>>> rather than have the whole loop below.
> >>>> Sure. Thanks.
> >>>>
> >>>>
> >>>>>> +
> >>>>>> +               /*
> >>>>>> +                * The same register set is supported among different hybrid PMUs.
> >>>>>> +                * Only check the first available one.
> >>>>>> +                */
> >>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >>>>>> +                       type = pmu->type;
> >>>>>> +                       break;
> >>>>>> +               }
> >>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       event_attr_init(&attr);
> >>>>>> +
> >>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>> +       if (fd != -1) {
> >>>>>> +               close(fd);
> >>>>>> +               return true;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return false;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>> +{
> >>>>>> +       bool supported = false;
> >>>>>> +       u64 bits;
> >>>>>> +
> >>>>>> +       *mask = 0;
> >>>>>> +       *qwords = 0;
> >>>>>> +
> >>>>>> +       switch (reg) {
> >>>>>> +       case PERF_REG_X86_XMM:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       case PERF_REG_X86_YMM:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       case PERF_REG_X86_ZMM:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >>>>>> +                       break;
> >>>>>> +               }
> >>>>>> +
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       default:
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return supported;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>> +{
> >>>>>> +       bool supported = false;
> >>>>>> +       u64 bits;
> >>>>>> +
> >>>>>> +       *mask = 0;
> >>>>>> +       *qwords = 0;
> >>>>>> +
> >>>>>> +       switch (reg) {
> >>>>>> +       case PERF_REG_X86_OPMASK:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       default:
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return supported;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool has_cap_simd_regs(void)
> >>>>>> +{
> >>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >>>>>> +       static bool has_cap_simd_regs;
> >>>>>> +       static bool cached;
> >>>>>> +
> >>>>>> +       if (cached)
> >>>>>> +               return has_cap_simd_regs;
> >>>>>> +
> >>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>> +       cached = true;
> >>>>>> +
> >>>>>> +       return has_cap_simd_regs;
> >>>>>> +}
> >>>>>> +
> >>>>>> +bool arch_has_simd_regs(u64 mask)
> >>>>>> +{
> >>>>>> +       return has_cap_simd_regs() &&
> >>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
> >>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >>>>>> +       SMPL_REG_END
> >>>>>> +};
> >>>>>> +
> >>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
> >>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >>>>>> +       SMPL_REG_END
> >>>>>> +};
> >>>>>> +
> >>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_simd_reg_masks;
> >>>>>> +}
> >>>>>> +
> >>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_pred_reg_masks;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool x86_intr_simd_updated;
> >>>>>> +static u64 x86_intr_simd_reg_mask;
> >>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>> Could we add some comments? I can kind of figure out the updated is a
> >>>>> check for lazy initialization and what masks are, qwords is an odd
> >>>>> one. The comment could also point out that SIMD doesn't mean the
> >>>>> machine supports SIMD, but SIMD registers are supported in perf
> >>>>> events.
> >>>> Sure.
> >>>>
> >>>>
> >>>>>> +static bool x86_user_simd_updated;
> >>>>>> +static u64 x86_user_simd_reg_mask;
> >>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>> +
> >>>>>> +static bool x86_intr_pred_updated;
> >>>>>> +static u64 x86_intr_pred_reg_mask;
> >>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +static bool x86_user_pred_updated;
> >>>>>> +static u64 x86_user_pred_reg_mask;
> >>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +
> >>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool supported;
> >>>>>> +       u64 mask = 0;
> >>>>>> +       int reg;
> >>>>>> +
> >>>>>> +       if (!has_cap_simd_regs())
> >>>>>> +               return 0;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >>>>>> +               return x86_intr_simd_reg_mask;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >>>>>> +               return x86_user_simd_reg_mask;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>> +               supported = false;
> >>>>>> +
> >>>>>> +               if (!r->mask)
> >>>>>> +                       continue;
> >>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>> +
> >>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >>>>>> +                       break;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_intr_simd_mask[reg],
> >>>>>> +                                                        &x86_intr_simd_qwords[reg]);
> >>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_user_simd_mask[reg],
> >>>>>> +                                                        &x86_user_simd_qwords[reg]);
> >>>>>> +               if (supported)
> >>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>> +               x86_intr_simd_reg_mask = mask;
> >>>>>> +               x86_intr_simd_updated = true;
> >>>>>> +       } else {
> >>>>>> +               x86_user_simd_reg_mask = mask;
> >>>>>> +               x86_user_simd_updated = true;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool supported;
> >>>>>> +       u64 mask = 0;
> >>>>>> +       int reg;
> >>>>>> +
> >>>>>> +       if (!has_cap_simd_regs())
> >>>>>> +               return 0;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >>>>>> +               return x86_intr_pred_reg_mask;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >>>>>> +               return x86_user_pred_reg_mask;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>> +               supported = false;
> >>>>>> +
> >>>>>> +               if (!r->mask)
> >>>>>> +                       continue;
> >>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>> +
> >>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >>>>>> +                       break;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_intr_pred_mask[reg],
> >>>>>> +                                                        &x86_intr_pred_qwords[reg]);
> >>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_user_pred_mask[reg],
> >>>>>> +                                                        &x86_user_pred_qwords[reg]);
> >>>>>> +               if (supported)
> >>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>> +               x86_intr_pred_reg_mask = mask;
> >>>>>> +               x86_intr_pred_updated = true;
> >>>>>> +       } else {
> >>>>>> +               x86_user_pred_reg_mask = mask;
> >>>>>> +               x86_user_pred_updated = true;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>> This feels repetitive with __arch__simd_reg_mask, could they be
> >>>>> refactored together?
> >>>> hmm, it looks we can extract the for loop as a common function. The other
> >>>> parts are hard to be generalized since they are manipulating different
> >>>> variables. If we want to generalize them, we have to introduce lots of "if
> >>>> ... else" branches and that would make code hard to be read.
> >>>>
> >>>>
> >>>>>> +
> >>>>>> +uint64_t arch__intr_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>> +{
> >>>>>> +       uint64_t mask = 0;
> >>>>>> +
> >>>>>> +       *qwords = 0;
> >>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >>>>>> +               if (intr) {
> >>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
> >>>>>> +                       mask = x86_intr_simd_mask[reg];
> >>>>>> +               } else {
> >>>>>> +                       *qwords = x86_user_simd_qwords[reg];
> >>>>>> +                       mask = x86_user_simd_mask[reg];
> >>>>>> +               }
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>> +{
> >>>>>> +       uint64_t mask = 0;
> >>>>>> +
> >>>>>> +       *qwords = 0;
> >>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >>>>>> +               if (intr) {
> >>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
> >>>>>> +                       mask = x86_intr_pred_mask[reg];
> >>>>>> +               } else {
> >>>>>> +                       *qwords = x86_user_pred_qwords[reg];
> >>>>>> +                       mask = x86_user_pred_mask[reg];
> >>>>>> +               }
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_intr_simd_updated)
> >>>>>> +               arch__intr_simd_reg_mask();
> >>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_user_simd_updated)
> >>>>>> +               arch__user_simd_reg_mask();
> >>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_intr_pred_updated)
> >>>>>> +               arch__intr_pred_reg_mask();
> >>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_user_pred_updated)
> >>>>>> +               arch__user_pred_reg_mask();
> >>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >>>>>> +}
> >>>>>> +
> >>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
> >>>>>>  {
> >>>>>> +       if (has_cap_simd_regs())
> >>>>>> +               return sample_reg_masks_ext;
> >>>>>>         return sample_reg_masks;
> >>>>>>  }
> >>>>>>
> >>>>>> -uint64_t arch__intr_reg_mask(void)
> >>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>>>>>  {
> >>>>>>         struct perf_event_attr attr = {
> >>>>>> -               .type                   = PERF_TYPE_HARDWARE,
> >>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >>>>>> -               .precise_ip             = 1,
> >>>>>> -               .disabled               = 1,
> >>>>>> -               .exclude_kernel         = 1,
> >>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>> +               .sample_type                    = sample_type,
> >>>>>> +               .precise_ip                     = 1,
> >>>>>> +               .disabled                       = 1,
> >>>>>> +               .exclude_kernel                 = 1,
> >>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>>>>>         };
> >>>>>>         int fd;
> >>>>>>         /*
> >>>>>>          * In an unnamed union, init it here to build on older gcc versions
> >>>>>>          */
> >>>>>>         attr.sample_period = 1;
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +               attr.sample_regs_intr = mask;
> >>>>>> +       else
> >>>>>> +               attr.sample_regs_user = mask;
> >>>>>>
> >>>>>>         if (perf_pmus__num_core_pmus() > 1) {
> >>>>>>                 struct perf_pmu *pmu = NULL;
> >>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>>         if (fd != -1) {
> >>>>>>                 close(fd);
> >>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >>>>>> +               return mask;
> >>>>>>         }
> >>>>>>
> >>>>>> -       return PERF_REGS_MASK;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_reg_mask(void)
> >>>>>> +{
> >>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>> +
> >>>>>> +       if (has_cap_simd_regs()) {
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>> +                                        true);
> >>>>> It's nice to label constant arguments like this something like:
> >>>>> /*has_simd_regs=*/true);
> >>>>>
> >>>>> Tools like clang-tidy even try to enforce the argument names match the comments.
> >>>> Sure.
> >>>>
> >>>>
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>> +                                        true);
> >>>>>> +       } else
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>>  }
> >>>>>>
> >>>>>>  uint64_t arch__user_reg_mask(void)
> >>>>>>  {
> >>>>>> -       return PERF_REGS_MASK;
> >>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>> +
> >>>>>> +       if (has_cap_simd_regs()) {
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>> +                                        true);
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>> +                                        true);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>> The code is repetitive here, could we refactor into a single function
> >>>>> passing in a user or instr value?
> >>>> Sure. Would extract the common part.
> >>>>
> >>>>
> >>>>>>  }
> >>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>>>> index 56ebefd075f2..5d1d90cf9488 100644
> >>>>>> --- a/tools/perf/util/evsel.c
> >>>>>> +++ b/tools/perf/util/evsel.c
> >>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>> +               else
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>>         }
> >>>>>>
> >>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
> >>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>> +               else
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>>         }
> >>>>>>
> >>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >>>>>> index cda1c620968e..0bd100392889 100644
> >>>>>> --- a/tools/perf/util/parse-regs-options.c
> >>>>>> +++ b/tools/perf/util/parse-regs-options.c
> >>>>>> @@ -4,19 +4,139 @@
> >>>>>>  #include <stdint.h>
> >>>>>>  #include <string.h>
> >>>>>>  #include <stdio.h>
> >>>>>> +#include <linux/bitops.h>
> >>>>>>  #include "util/debug.h"
> >>>>>>  #include <subcmd/parse-options.h>
> >>>>>>  #include "util/perf_regs.h"
> >>>>>>  #include "util/parse-regs-options.h"
> >>>>>> +#include "record.h"
> >>>>>> +
> >>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       if (!simd_mask)
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>> +               if (!(r->mask & simd_mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               if (bitmap)
> >>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>> +       }
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       if (!pred_mask)
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>> +               if (!(r->mask & pred_mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               if (bitmap)
> >>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>> +       }
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool matched = false;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>> +               if (strcasecmp(s, r->name))
> >>>>>> +                       continue;
> >>>>>> +               if (!fls64(r->mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               matched = true;
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       /* Just need the highest qwords */
> >>>>> I'm not following here. Does the bitmap need to handle gaps?
> >>>> Currently no. In theory, the kernel supports user space only samples a
> >>>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
> >>>> supports 16 XMM registers on XMM), but it's not supported to avoid
> >>>> introducing too much complexity in perf tools. Moreover, I don't think end
> >>>> users have such requirement. In most cases, users should only know which
> >>>> kinds of SIMD registers their programs use but usually doesn't know and
> >>>> care about which exact SIMD register is used.
> >>>>
> >>>>
> >>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
> >>>>>> +               opts->sample_vec_regs_qwords = qwords;
> >>>>>> +               if (intr)
> >>>>>> +                       opts->sample_intr_vec_regs = bitmap;
> >>>>>> +               else
> >>>>>> +                       opts->sample_user_vec_regs = bitmap;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return matched;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool matched = false;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>> +               if (strcasecmp(s, r->name))
> >>>>>> +                       continue;
> >>>>>> +               if (!fls64(r->mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               matched = true;
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       /* Just need the highest qwords */
> >>>>> Again repetitive, could we have a single function?
> >>>> Yes, I suppose the for loop at least can be extracted as a common function.
> >>>>
> >>>>
> >>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
> >>>>>> +               opts->sample_pred_regs_qwords = qwords;
> >>>>>> +               if (intr)
> >>>>>> +                       opts->sample_intr_pred_regs = bitmap;
> >>>>>> +               else
> >>>>>> +                       opts->sample_user_pred_regs = bitmap;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return matched;
> >>>>>> +}
> >>>>>>
> >>>>>>  static int
> >>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>  {
> >>>>>>         uint64_t *mode = (uint64_t *)opt->value;
> >>>>>>         const struct sample_reg *r = NULL;
> >>>>>> +       struct record_opts *opts;
> >>>>>>         char *s, *os = NULL, *p;
> >>>>>> -       int ret = -1;
> >>>>>> +       bool has_simd_regs = false;
> >>>>>>         uint64_t mask;
> >>>>>> +       uint64_t simd_mask;
> >>>>>> +       uint64_t pred_mask;
> >>>>>> +       int ret = -1;
> >>>>>>
> >>>>>>         if (unset)
> >>>>>>                 return 0;
> >>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>         if (*mode)
> >>>>>>                 return -1;
> >>>>>>
> >>>>>> -       if (intr)
> >>>>>> +       if (intr) {
> >>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>>>>>                 mask = arch__intr_reg_mask();
> >>>>>> -       else
> >>>>>> +               simd_mask = arch__intr_simd_reg_mask();
> >>>>>> +               pred_mask = arch__intr_pred_reg_mask();
> >>>>>> +       } else {
> >>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>>>>>                 mask = arch__user_reg_mask();
> >>>>>> +               simd_mask = arch__user_simd_reg_mask();
> >>>>>> +               pred_mask = arch__user_pred_reg_mask();
> >>>>>> +       }
> >>>>>>
> >>>>>>         /* str may be NULL in case no arg is passed to -I */
> >>>>>>         if (str) {
> >>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>                                         if (r->mask & mask)
> >>>>>>                                                 fprintf(stderr, "%s ", r->name);
> >>>>>>                                 }
> >>>>>> +                               __print_simd_regs(intr, simd_mask);
> >>>>>> +                               __print_pred_regs(intr, pred_mask);
> >>>>>>                                 fputc('\n', stderr);
> >>>>>>                                 /* just printing available regs */
> >>>>>>                                 goto error;
> >>>>>>                         }
> >>>>>> +
> >>>>>> +                       if (simd_mask) {
> >>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >>>>>> +                               if (has_simd_regs)
> >>>>>> +                                       goto next;
> >>>>>> +                       }
> >>>>>> +                       if (pred_mask) {
> >>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >>>>>> +                               if (has_simd_regs)
> >>>>>> +                                       goto next;
> >>>>>> +                       }
> >>>>>> +
> >>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>>>>>                                         break;
> >>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>                         }
> >>>>>>
> >>>>>>                         *mode |= r->mask;
> >>>>>> -
> >>>>>> +next:
> >>>>>>                         if (!p)
> >>>>>>                                 break;
> >>>>>>
> >>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>         ret = 0;
> >>>>>>
> >>>>>>         /* default to all possible regs */
> >>>>>> -       if (*mode == 0)
> >>>>>> +       if (*mode == 0 && !has_simd_regs)
> >>>>>>                 *mode = mask;
> >>>>>>  error:
> >>>>>>         free(os);
> >>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>> index 66b666d9ce64..fb0366d050cf 100644
> >>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
> >>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>>>>>
> >>>>>>         return ret;
> >>>>>>  }
> >>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
> >>>>>> --- a/tools/perf/util/perf_regs.c
> >>>>>> +++ b/tools/perf/util/perf_regs.c
> >>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>>>>>         return SDT_ARG_SKIP;
> >>>>>>  }
> >>>>>>
> >>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >>>>>> +{
> >>>>>> +       return false;
> >>>>>> +}
> >>>>>> +
> >>>>>>  uint64_t __weak arch__intr_reg_mask(void)
> >>>>>>  {
> >>>>>>         return 0;
> >>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>>>>>         return 0;
> >>>>>>  }
> >>>>>>
> >>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>         SMPL_REG_END
> >>>>>>  };
> >>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>>>>>         return sample_reg_masks;
> >>>>>>  }
> >>>>>>
> >>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_reg_masks;
> >>>>>> +}
> >>>>>> +
> >>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_reg_masks;
> >>>>>> +}
> >>>>> Thinking out loud. I wonder if there is a way to hide the weak
> >>>>> functions. It seems the support is tied to PMUs, particularly core
> >>>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
> >>>>> ask the PMU to parse the register strings, set up the perf_event_attr,
> >>>>> etc. I'm somewhat scared these functions will be used on the report
> >>>>> rather than record side of things, thereby breaking perf.data support
> >>>>> when the host kernel does or doesn't have the SIMD support.
> >>>> Ian, I don't quite follow your words.
> >>>>
> >>>> I don't quite understand how should we do for "push things into pmu and
> >>>> arch pmu code". Current SIMD registers support follows the same way of the
> >>>> general registers support. If we intend to change the way entirely, we'd
> >>>> better have an independent patch-set.
> >>>>
> >>>> why these functions would break the perf.data repport? perf-report would
> >>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
> >>>> the flags is set (indicates there are SIMD registers data appended in the
> >>>> record), perf-report would try to parse the SIMD registers data.
> >>> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
> >>> remove weak symbols like:
> >>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
> >>>
> >>> For these patches what I'm imagining is that there is a Nova Lake
> >>> generated perf.data file. Using perf report, script, etc. on the Nova
> >>> Lake should expose all of the same mask, qword, etc. values as when
> >>> the perf.data was generated and so things will work. If the perf.data
> >>> file was taken to say my Alderlake then what will happen? Generally
> >>> using the arch directory and weak symbols is a code smell that cross
> >>> platform things are going to break - there should be sufficient data
> >>> in the event and the perf_event_attr to fully decode what's going on.
> >>> Sometimes tying things to a PMU name can avoid the use of the arch
> >>> directory. We were able to avoid the arch directory to a good extent
> >>> for the TPEBS code, even though it is a very modern Intel feature.
> >> I see.
> >>
> >> But the sampling support for SIMD registers is different with the sample
> >> weight processing in the patch
> >> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
> >> Each arch may support different kinds of SIMD registers and furthermore
> >> each kind of SIMD register may have different register number and register
> >> width. It's quite hard to figure out some common functions or fields to
> >> represent the name and attributes of these arch-specific SIMD registers.
> >> These arch-specific information can only be told by the arch-specific code.
> >> So it looks the __weak functions are still the easiest way to implement this.
> >>
> >> I don't think the perf.data parsing would be broken from a platform to
> >> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
> >> To indicates the presence of SIMD registers in record data, a new ABI flag
> >> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
> >> platform is new enough and can recognize this new flag, then the SIMD
> >> registers data would be parsed correctly. Even though the perf tool is old
> >> and have no support of SIMD register, the data of SIMD registers would just
> >> be silently ignored and should not break the parsing.
> > That's good to know. I'm confused then why these functions can't just
> > be within the arch directory? For example, we don't expose the
> > intel-pt PMU code in the common code except for the parsing parts. A
> > lot of that is handled by the default perf_event_attr initialization
> > that every PMU can have its own variant of:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123
>
> I see. From my point of view, there seems no essential difference between a
> function pointer and a __weak function, and it looks hard to find a common
> data structure to save all these function pointers which needs to be called
> in different places, like register name parsing, register data dumpling ...
>
>
> >
> > Perhaps this is all just evidence of tech debt in the perf_regs.c code
> > :-/ The bit that's relevant to the patch here is that I think this is
> > adding to the tech debt problem as 11 more functions are added to
> > perf_regs.h.
>
> Yeah, 11 new __weak functions seems too much, we may merge the same kinds
> of functions, like merging *_simd_reg_mask() and  *_pred_reg_mask() to a
> single function with an type argument, then the new added __weak functions
> could shrink half.

There could be a good reason for 11 weak functions :-) In the
perf_event.h you've added to the sample event:
```
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;
+        *              u16 vector_qwords;
+        *              u16 nr_pred;
+        *              u16 pred_qwords;
+        *              u64 data[nr_vectors * vector_qwords + nr_pred
* pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_USER
```
so these things are readable/writable outside of builds with arch/x86
compiled in, which is why it seems odd that there needs to be arch
code in the common code to handle them. Similar to how I needed to get
the retirement latency parsing out of the arch/x86 directory as
potentially you could be looking at a perf.data file with retirement
latencies in it on a non-x86 platform.

Thanks,
Ian

>
> >
> > Thanks,
> > Ian
> >
> >>> Thanks,
> >>> Ian
> >>>
> >>>
> >>>
> >>>>> Thanks,
> >>>>> Ian
> >>>>>
> >>>>>> +
> >>>>>>  const char *perf_reg_name(int id, const char *arch)
> >>>>>>  {
> >>>>>>         const char *reg_name = NULL;
> >>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
> >>>>>> --- a/tools/perf/util/perf_regs.h
> >>>>>> +++ b/tools/perf/util/perf_regs.h
> >>>>>> @@ -24,9 +24,20 @@ enum {
> >>>>>>  };
> >>>>>>
> >>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >>>>>> +bool arch_has_simd_regs(u64 mask);
> >>>>>>  uint64_t arch__intr_reg_mask(void);
> >>>>>>  uint64_t arch__user_reg_mask(void);
> >>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
> >>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> >>>>>> +uint64_t arch__intr_simd_reg_mask(void);
> >>>>>> +uint64_t arch__user_simd_reg_mask(void);
> >>>>>> +uint64_t arch__intr_pred_reg_mask(void);
> >>>>>> +uint64_t arch__user_pred_reg_mask(void);
> >>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>
> >>>>>>  const char *perf_reg_name(int id, const char *arch);
> >>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
> >>>>>> --- a/tools/perf/util/record.h
> >>>>>> +++ b/tools/perf/util/record.h
> >>>>>> @@ -59,7 +59,13 @@ struct record_opts {
> >>>>>>         unsigned int  user_freq;
> >>>>>>         u64           branch_stack;
> >>>>>>         u64           sample_intr_regs;
> >>>>>> +       u64           sample_intr_vec_regs;
> >>>>>>         u64           sample_user_regs;
> >>>>>> +       u64           sample_user_vec_regs;
> >>>>>> +       u16           sample_pred_regs_qwords;
> >>>>>> +       u16           sample_vec_regs_qwords;
> >>>>>> +       u16           sample_intr_pred_regs;
> >>>>>> +       u16           sample_user_pred_regs;
> >>>>>>         u64           default_interval;
> >>>>>>         u64           user_interval;
> >>>>>>         size_t        auxtrace_snapshot_size;
> >>>>>> --
> >>>>>> 2.34.1
> >>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-05  6:38               ` Ian Rogers
@ 2025-12-05  8:10                 ` Mi, Dapeng
  2025-12-05 16:35                   ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-05  8:10 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 2:38 PM, Ian Rogers wrote:
> On Thu, Dec 4, 2025 at 8:00 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 12/5/2025 12:16 AM, Ian Rogers wrote:
>>> On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 12/4/2025 3:49 PM, Ian Rogers wrote:
>>>>> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
>>>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>>
>>>>>>>> This patch adds support for the newly introduced SIMD register sampling
>>>>>>>> format by adding the following functions:
>>>>>>>>
>>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>
>>>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>>>>>
>>>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>>>>>
>>>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>>>>>
>>>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>>>>>> OPMASK).
>>>>>>>>
>>>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>>>>>> these newly introduced SIMD registers. Currently, each type of register
>>>>>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>>>>>> not supported. For example, all XMM registers are sampled together rather
>>>>>>>> than sampling only XMM0.
>>>>>>>>
>>>>>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>>>>>
>>>>>>>> With this patch, all supported sampling registers on x86 platforms are
>>>>>>>> displayed as follows.
>>>>>>>>
>>>>>>>>  $perf record -I?
>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>
>>>>>>>>  $perf record --user-regs=?
>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>
>>>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>> ---
>>>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>>>>>  tools/perf/util/record.h                  |   6 +
>>>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>> index 12fd93f04802..db41430f3b07 100644
>>>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>> @@ -13,6 +13,49 @@
>>>>>>>>  #include "../../../util/pmu.h"
>>>>>>>>  #include "../../../util/pmus.h"
>>>>>>>>
>>>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>>>>>> +#endif
>>>>>>>> +       SMPL_REG_END
>>>>>>>> +};
>>>>>>>> +
>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>>>>>         return SDT_ARG_VALID;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>>>>>> To make the code easier to read, it'd be nice to document sample_type,
>>>>>>> qwords and mask here.
>>>>>> Sure.
>>>>>>
>>>>>>
>>>>>>>> +{
>>>>>>>> +       struct perf_event_attr attr = {
>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>> +               .disabled                       = 1,
>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>> +               .sample_simd_regs_enabled       = 1,
>>>>>>>> +       };
>>>>>>>> +       int fd;
>>>>>>>> +
>>>>>>>> +       attr.sample_period = 1;
>>>>>>>> +
>>>>>>>> +       if (!pred) {
>>>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>>>>>> +               else
>>>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>>>>>> +       } else {
>>>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>>>>>> +               else
>>>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>> +               struct perf_pmu *pmu = NULL;
>>>>>>>> +               __u64 type = PERF_TYPE_RAW;
>>>>>>> It should be okay to do:
>>>>>>> __u64 type = perf_pmus__find_core_pmu()->type
>>>>>>> rather than have the whole loop below.
>>>>>> Sure. Thanks.
>>>>>>
>>>>>>
>>>>>>>> +
>>>>>>>> +               /*
>>>>>>>> +                * The same register set is supported among different hybrid PMUs.
>>>>>>>> +                * Only check the first available one.
>>>>>>>> +                */
>>>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>>>>>> +                       type = pmu->type;
>>>>>>>> +                       break;
>>>>>>>> +               }
>>>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       event_attr_init(&attr);
>>>>>>>> +
>>>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>> +       if (fd != -1) {
>>>>>>>> +               close(fd);
>>>>>>>> +               return true;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return false;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       bool supported = false;
>>>>>>>> +       u64 bits;
>>>>>>>> +
>>>>>>>> +       *mask = 0;
>>>>>>>> +       *qwords = 0;
>>>>>>>> +
>>>>>>>> +       switch (reg) {
>>>>>>>> +       case PERF_REG_X86_XMM:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       case PERF_REG_X86_YMM:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       case PERF_REG_X86_ZMM:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>>>>>> +                       break;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       default:
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return supported;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       bool supported = false;
>>>>>>>> +       u64 bits;
>>>>>>>> +
>>>>>>>> +       *mask = 0;
>>>>>>>> +       *qwords = 0;
>>>>>>>> +
>>>>>>>> +       switch (reg) {
>>>>>>>> +       case PERF_REG_X86_OPMASK:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       default:
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return supported;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool has_cap_simd_regs(void)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>>>>>> +       static bool has_cap_simd_regs;
>>>>>>>> +       static bool cached;
>>>>>>>> +
>>>>>>>> +       if (cached)
>>>>>>>> +               return has_cap_simd_regs;
>>>>>>>> +
>>>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>> +       cached = true;
>>>>>>>> +
>>>>>>>> +       return has_cap_simd_regs;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +bool arch_has_simd_regs(u64 mask)
>>>>>>>> +{
>>>>>>>> +       return has_cap_simd_regs() &&
>>>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>>>>>> +       SMPL_REG_END
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>>>>>> +       SMPL_REG_END
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_simd_reg_masks;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_pred_reg_masks;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool x86_intr_simd_updated;
>>>>>>>> +static u64 x86_intr_simd_reg_mask;
>>>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>> Could we add some comments? I can kind of figure out the updated is a
>>>>>>> check for lazy initialization and what masks are, qwords is an odd
>>>>>>> one. The comment could also point out that SIMD doesn't mean the
>>>>>>> machine supports SIMD, but SIMD registers are supported in perf
>>>>>>> events.
>>>>>> Sure.
>>>>>>
>>>>>>
>>>>>>>> +static bool x86_user_simd_updated;
>>>>>>>> +static u64 x86_user_simd_reg_mask;
>>>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>> +
>>>>>>>> +static bool x86_intr_pred_updated;
>>>>>>>> +static u64 x86_intr_pred_reg_mask;
>>>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +static bool x86_user_pred_updated;
>>>>>>>> +static u64 x86_user_pred_reg_mask;
>>>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +
>>>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool supported;
>>>>>>>> +       u64 mask = 0;
>>>>>>>> +       int reg;
>>>>>>>> +
>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>> +               return 0;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>>>>>> +               return x86_intr_simd_reg_mask;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>>>>>> +               return x86_user_simd_reg_mask;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>> +               supported = false;
>>>>>>>> +
>>>>>>>> +               if (!r->mask)
>>>>>>>> +                       continue;
>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>> +
>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>>>>>> +                       break;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_intr_simd_mask[reg],
>>>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_user_simd_mask[reg],
>>>>>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>>>>>> +               if (supported)
>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>> +               x86_intr_simd_reg_mask = mask;
>>>>>>>> +               x86_intr_simd_updated = true;
>>>>>>>> +       } else {
>>>>>>>> +               x86_user_simd_reg_mask = mask;
>>>>>>>> +               x86_user_simd_updated = true;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool supported;
>>>>>>>> +       u64 mask = 0;
>>>>>>>> +       int reg;
>>>>>>>> +
>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>> +               return 0;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>>>>>> +               return x86_intr_pred_reg_mask;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>>>>>> +               return x86_user_pred_reg_mask;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>> +               supported = false;
>>>>>>>> +
>>>>>>>> +               if (!r->mask)
>>>>>>>> +                       continue;
>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>> +
>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>>>>>> +                       break;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_intr_pred_mask[reg],
>>>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_user_pred_mask[reg],
>>>>>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>>>>>> +               if (supported)
>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>> +               x86_intr_pred_reg_mask = mask;
>>>>>>>> +               x86_intr_pred_updated = true;
>>>>>>>> +       } else {
>>>>>>>> +               x86_user_pred_reg_mask = mask;
>>>>>>>> +               x86_user_pred_updated = true;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>> This feels repetitive with __arch__simd_reg_mask, could they be
>>>>>>> refactored together?
>>>>>> hmm, it looks we can extract the for loop as a common function. The other
>>>>>> parts are hard to be generalized since they are manipulating different
>>>>>> variables. If we want to generalize them, we have to introduce lots of "if
>>>>>> ... else" branches and that would make code hard to be read.
>>>>>>
>>>>>>
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = 0;
>>>>>>>> +
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>>>>>> +               if (intr) {
>>>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>>>>>> +                       mask = x86_intr_simd_mask[reg];
>>>>>>>> +               } else {
>>>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>>>>>> +                       mask = x86_user_simd_mask[reg];
>>>>>>>> +               }
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = 0;
>>>>>>>> +
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>>>>>> +               if (intr) {
>>>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>>>>>> +                       mask = x86_intr_pred_mask[reg];
>>>>>>>> +               } else {
>>>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>>>>>> +                       mask = x86_user_pred_mask[reg];
>>>>>>>> +               }
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_intr_simd_updated)
>>>>>>>> +               arch__intr_simd_reg_mask();
>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_user_simd_updated)
>>>>>>>> +               arch__user_simd_reg_mask();
>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_intr_pred_updated)
>>>>>>>> +               arch__intr_pred_reg_mask();
>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_user_pred_updated)
>>>>>>>> +               arch__user_pred_reg_mask();
>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>>>>>  {
>>>>>>>> +       if (has_cap_simd_regs())
>>>>>>>> +               return sample_reg_masks_ext;
>>>>>>>>         return sample_reg_masks;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> -uint64_t arch__intr_reg_mask(void)
>>>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>>>>>  {
>>>>>>>>         struct perf_event_attr attr = {
>>>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>>>>>> -               .precise_ip             = 1,
>>>>>>>> -               .disabled               = 1,
>>>>>>>> -               .exclude_kernel         = 1,
>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>> +               .precise_ip                     = 1,
>>>>>>>> +               .disabled                       = 1,
>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>>>>>         };
>>>>>>>>         int fd;
>>>>>>>>         /*
>>>>>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>>>>>          */
>>>>>>>>         attr.sample_period = 1;
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +               attr.sample_regs_intr = mask;
>>>>>>>> +       else
>>>>>>>> +               attr.sample_regs_user = mask;
>>>>>>>>
>>>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>>                 struct perf_pmu *pmu = NULL;
>>>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>>         if (fd != -1) {
>>>>>>>>                 close(fd);
>>>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>>>>>> +               return mask;
>>>>>>>>         }
>>>>>>>>
>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>> +
>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>> +                                        true);
>>>>>>> It's nice to label constant arguments like this something like:
>>>>>>> /*has_simd_regs=*/true);
>>>>>>>
>>>>>>> Tools like clang-tidy even try to enforce the argument names match the comments.
>>>>>> Sure.
>>>>>>
>>>>>>
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>> +                                        true);
>>>>>>>> +       } else
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  uint64_t arch__user_reg_mask(void)
>>>>>>>>  {
>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>> +
>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>> +                                        true);
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>> +                                        true);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>> The code is repetitive here, could we refactor into a single function
>>>>>>> passing in a user or instr value?
>>>>>> Sure. Would extract the common part.
>>>>>>
>>>>>>
>>>>>>>>  }
>>>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>>>>>> --- a/tools/perf/util/evsel.c
>>>>>>>> +++ b/tools/perf/util/evsel.c
>>>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>> +               else
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>> +               else
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>>         }
>>>>>>>>
>>>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>>>>>> index cda1c620968e..0bd100392889 100644
>>>>>>>> --- a/tools/perf/util/parse-regs-options.c
>>>>>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>>>>>> @@ -4,19 +4,139 @@
>>>>>>>>  #include <stdint.h>
>>>>>>>>  #include <string.h>
>>>>>>>>  #include <stdio.h>
>>>>>>>> +#include <linux/bitops.h>
>>>>>>>>  #include "util/debug.h"
>>>>>>>>  #include <subcmd/parse-options.h>
>>>>>>>>  #include "util/perf_regs.h"
>>>>>>>>  #include "util/parse-regs-options.h"
>>>>>>>> +#include "record.h"
>>>>>>>> +
>>>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       if (!simd_mask)
>>>>>>>> +               return;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>> +               if (!(r->mask & simd_mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               if (bitmap)
>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>> +       }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       if (!pred_mask)
>>>>>>>> +               return;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>> +               if (!(r->mask & pred_mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               if (bitmap)
>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>> +       }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool matched = false;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>> +                       continue;
>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               matched = true;
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       /* Just need the highest qwords */
>>>>>>> I'm not following here. Does the bitmap need to handle gaps?
>>>>>> Currently no. In theory, the kernel supports user space only samples a
>>>>>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
>>>>>> supports 16 XMM registers on XMM), but it's not supported to avoid
>>>>>> introducing too much complexity in perf tools. Moreover, I don't think end
>>>>>> users have such requirement. In most cases, users should only know which
>>>>>> kinds of SIMD registers their programs use but usually doesn't know and
>>>>>> care about which exact SIMD register is used.
>>>>>>
>>>>>>
>>>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>>>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>>>>>> +               if (intr)
>>>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>>>>>> +               else
>>>>>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return matched;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool matched = false;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>> +                       continue;
>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               matched = true;
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       /* Just need the highest qwords */
>>>>>>> Again repetitive, could we have a single function?
>>>>>> Yes, I suppose the for loop at least can be extracted as a common function.
>>>>>>
>>>>>>
>>>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>>>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>>>>>> +               if (intr)
>>>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>>>>>> +               else
>>>>>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return matched;
>>>>>>>> +}
>>>>>>>>
>>>>>>>>  static int
>>>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>  {
>>>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>>>>>         const struct sample_reg *r = NULL;
>>>>>>>> +       struct record_opts *opts;
>>>>>>>>         char *s, *os = NULL, *p;
>>>>>>>> -       int ret = -1;
>>>>>>>> +       bool has_simd_regs = false;
>>>>>>>>         uint64_t mask;
>>>>>>>> +       uint64_t simd_mask;
>>>>>>>> +       uint64_t pred_mask;
>>>>>>>> +       int ret = -1;
>>>>>>>>
>>>>>>>>         if (unset)
>>>>>>>>                 return 0;
>>>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>         if (*mode)
>>>>>>>>                 return -1;
>>>>>>>>
>>>>>>>> -       if (intr)
>>>>>>>> +       if (intr) {
>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>>>>>                 mask = arch__intr_reg_mask();
>>>>>>>> -       else
>>>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>>>>>> +       } else {
>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>>>>>                 mask = arch__user_reg_mask();
>>>>>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>>>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>>>>>> +       }
>>>>>>>>
>>>>>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>>>>>         if (str) {
>>>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>                                         if (r->mask & mask)
>>>>>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>>>>>                                 }
>>>>>>>> +                               __print_simd_regs(intr, simd_mask);
>>>>>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>>>>>                                 fputc('\n', stderr);
>>>>>>>>                                 /* just printing available regs */
>>>>>>>>                                 goto error;
>>>>>>>>                         }
>>>>>>>> +
>>>>>>>> +                       if (simd_mask) {
>>>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>>>>>> +                               if (has_simd_regs)
>>>>>>>> +                                       goto next;
>>>>>>>> +                       }
>>>>>>>> +                       if (pred_mask) {
>>>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>>>>>> +                               if (has_simd_regs)
>>>>>>>> +                                       goto next;
>>>>>>>> +                       }
>>>>>>>> +
>>>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>>>>>                                         break;
>>>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>                         }
>>>>>>>>
>>>>>>>>                         *mode |= r->mask;
>>>>>>>> -
>>>>>>>> +next:
>>>>>>>>                         if (!p)
>>>>>>>>                                 break;
>>>>>>>>
>>>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>         ret = 0;
>>>>>>>>
>>>>>>>>         /* default to all possible regs */
>>>>>>>> -       if (*mode == 0)
>>>>>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>>>>>                 *mode = mask;
>>>>>>>>  error:
>>>>>>>>         free(os);
>>>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>>>>>
>>>>>>>>         return ret;
>>>>>>>>  }
>>>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>>>>>> --- a/tools/perf/util/perf_regs.c
>>>>>>>> +++ b/tools/perf/util/perf_regs.c
>>>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>>>>>         return SDT_ARG_SKIP;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>>>>>> +{
>>>>>>>> +       return false;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>>>>>  {
>>>>>>>>         return 0;
>>>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>>>>>         return 0;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>         SMPL_REG_END
>>>>>>>>  };
>>>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>>>>>         return sample_reg_masks;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_reg_masks;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_reg_masks;
>>>>>>>> +}
>>>>>>> Thinking out loud. I wonder if there is a way to hide the weak
>>>>>>> functions. It seems the support is tied to PMUs, particularly core
>>>>>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
>>>>>>> ask the PMU to parse the register strings, set up the perf_event_attr,
>>>>>>> etc. I'm somewhat scared these functions will be used on the report
>>>>>>> rather than record side of things, thereby breaking perf.data support
>>>>>>> when the host kernel does or doesn't have the SIMD support.
>>>>>> Ian, I don't quite follow your words.
>>>>>>
>>>>>> I don't quite understand how should we do for "push things into pmu and
>>>>>> arch pmu code". Current SIMD registers support follows the same way of the
>>>>>> general registers support. If we intend to change the way entirely, we'd
>>>>>> better have an independent patch-set.
>>>>>>
>>>>>> why these functions would break the perf.data repport? perf-report would
>>>>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
>>>>>> the flags is set (indicates there are SIMD registers data appended in the
>>>>>> record), perf-report would try to parse the SIMD registers data.
>>>>> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
>>>>> remove weak symbols like:
>>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
>>>>>
>>>>> For these patches what I'm imagining is that there is a Nova Lake
>>>>> generated perf.data file. Using perf report, script, etc. on the Nova
>>>>> Lake should expose all of the same mask, qword, etc. values as when
>>>>> the perf.data was generated and so things will work. If the perf.data
>>>>> file was taken to say my Alderlake then what will happen? Generally
>>>>> using the arch directory and weak symbols is a code smell that cross
>>>>> platform things are going to break - there should be sufficient data
>>>>> in the event and the perf_event_attr to fully decode what's going on.
>>>>> Sometimes tying things to a PMU name can avoid the use of the arch
>>>>> directory. We were able to avoid the arch directory to a good extent
>>>>> for the TPEBS code, even though it is a very modern Intel feature.
>>>> I see.
>>>>
>>>> But the sampling support for SIMD registers is different with the sample
>>>> weight processing in the patch
>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
>>>> Each arch may support different kinds of SIMD registers and furthermore
>>>> each kind of SIMD register may have different register number and register
>>>> width. It's quite hard to figure out some common functions or fields to
>>>> represent the name and attributes of these arch-specific SIMD registers.
>>>> These arch-specific information can only be told by the arch-specific code.
>>>> So it looks the __weak functions are still the easiest way to implement this.
>>>>
>>>> I don't think the perf.data parsing would be broken from a platform to
>>>> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
>>>> To indicates the presence of SIMD registers in record data, a new ABI flag
>>>> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
>>>> platform is new enough and can recognize this new flag, then the SIMD
>>>> registers data would be parsed correctly. Even though the perf tool is old
>>>> and have no support of SIMD register, the data of SIMD registers would just
>>>> be silently ignored and should not break the parsing.
>>> That's good to know. I'm confused then why these functions can't just
>>> be within the arch directory? For example, we don't expose the
>>> intel-pt PMU code in the common code except for the parsing parts. A
>>> lot of that is handled by the default perf_event_attr initialization
>>> that every PMU can have its own variant of:
>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123
>> I see. From my point of view, there seems no essential difference between a
>> function pointer and a __weak function, and it looks hard to find a common
>> data structure to save all these function pointers which needs to be called
>> in different places, like register name parsing, register data dumpling ...
>>
>>
>>> Perhaps this is all just evidence of tech debt in the perf_regs.c code
>>> :-/ The bit that's relevant to the patch here is that I think this is
>>> adding to the tech debt problem as 11 more functions are added to
>>> perf_regs.h.
>> Yeah, 11 new __weak functions seems too much, we may merge the same kinds
>> of functions, like merging *_simd_reg_mask() and  *_pred_reg_mask() to a
>> single function with an type argument, then the new added __weak functions
>> could shrink half.
> There could be a good reason for 11 weak functions :-) In the
> perf_event.h you've added to the sample event:
> ```
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred
> * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_USER
> ```
> so these things are readable/writable outside of builds with arch/x86
> compiled in, which is why it seems odd that there needs to be arch
> code in the common code to handle them. Similar to how I needed to get
> the retirement latency parsing out of the arch/x86 directory as
> potentially you could be looking at a perf.data file with retirement
> latencies in it on a non-x86 platform.

Ian, I'm not sure if I fully get your point. If not, please correct.

Although these new introduced fields are generic and existed on all
architectures, it's not enough to get all the necessary information to dump
or parse the SIMD registers, e.g., the SIMD register name.

Let's take dumpling the sampled value of SIMD registers as an example.

We know there could be different kinds of SIMD register on different archs,
like XMM/YMM/ZMM on x86 and V-registers/Z-registers on ARM.

Currently we only know the register number and width from generic fields,
we have no way to directly know the exact name this SIMD register
corresponds. We have to involve the arch-specific function to figure out it
and then print them.

At least for now, it looks we still need these arch-specific functions ...


>
> Thanks,
> Ian
>
>>> Thanks,
>>> Ian
>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>
>>>>>
>>>>>>> Thanks,
>>>>>>> Ian
>>>>>>>
>>>>>>>> +
>>>>>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>>>>>  {
>>>>>>>>         const char *reg_name = NULL;
>>>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>>>>>> --- a/tools/perf/util/perf_regs.h
>>>>>>>> +++ b/tools/perf/util/perf_regs.h
>>>>>>>> @@ -24,9 +24,20 @@ enum {
>>>>>>>>  };
>>>>>>>>
>>>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>>>>>> +bool arch_has_simd_regs(u64 mask);
>>>>>>>>  uint64_t arch__intr_reg_mask(void);
>>>>>>>>  uint64_t arch__user_reg_mask(void);
>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>
>>>>>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>>>>>> --- a/tools/perf/util/record.h
>>>>>>>> +++ b/tools/perf/util/record.h
>>>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>>>>>         unsigned int  user_freq;
>>>>>>>>         u64           branch_stack;
>>>>>>>>         u64           sample_intr_regs;
>>>>>>>> +       u64           sample_intr_vec_regs;
>>>>>>>>         u64           sample_user_regs;
>>>>>>>> +       u64           sample_user_vec_regs;
>>>>>>>> +       u16           sample_pred_regs_qwords;
>>>>>>>> +       u16           sample_vec_regs_qwords;
>>>>>>>> +       u16           sample_intr_pred_regs;
>>>>>>>> +       u16           sample_user_pred_regs;
>>>>>>>>         u64           default_interval;
>>>>>>>>         u64           user_interval;
>>>>>>>>         size_t        auxtrace_snapshot_size;
>>>>>>>> --
>>>>>>>> 2.34.1
>>>>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER
  2025-12-04 18:59     ` Dave Hansen
@ 2025-12-05  8:42       ` Peter Zijlstra
  0 siblings, 0 replies; 86+ messages in thread
From: Peter Zijlstra @ 2025-12-05  8:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Thu, Dec 04, 2025 at 10:59:15AM -0800, Dave Hansen wrote:
> On 12/4/25 07:17, Peter Zijlstra wrote:
> >> - Additionally, checking the TIF_NEED_FPU_LOAD flag alone is insufficient.
> >>   Some corner cases, such as an NMI occurring just after the flag switches
> >>   but still in kernel mode, cannot be handled.
> > Urgh.. Dave, Thomas, is there any reason we could not set
> > TIF_NEED_FPU_LOAD *after* doing the XSAVE (clearing is already done
> > after restore).
> > 
> > That way, when an NMI sees TIF_NEED_FPU_LOAD it knows the task copy is
> > consistent.
> 
> Something like the attached patch?
> 
> I think that would be just fine. save_fpregs_to_fpstate() doesn't
> actually change the need for TIF_NEED_FPU_LOAD, so I don't think the
> ordering matters.

Right, I missed this one. And yes, I couldn't find any site where this
ordering mattered either. Its all with interrupts disabled, so normally
it all goes together. Only the NMI could observe the difference.

> diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
> index 89004f4ca208..2d57a7bf5406 100644
> --- a/arch/x86/include/asm/fpu/sched.h
> +++ b/arch/x86/include/asm/fpu/sched.h
> @@ -36,8 +36,8 @@ static inline void switch_fpu(struct task_struct *old, int cpu)
>  	    !(old->flags & (PF_KTHREAD | PF_USER_WORKER))) {
>  		struct fpu *old_fpu = x86_task_fpu(old);
>  
> -		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
>  		save_fpregs_to_fpstate(old_fpu);
> +		set_tsk_thread_flag(old, TIF_NEED_FPU_LOAD);
>  		/*
>  		 * The save operation preserved register state, so the
>  		 * fpu_fpregs_owner_ctx is still @old_fpu. Store the


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 07/19] perf: Add sampling support for SIMD registers
  2025-12-03  6:54 ` [Patch v5 07/19] perf: Add sampling support for SIMD registers Dapeng Mi
@ 2025-12-05 11:07   ` Peter Zijlstra
  2025-12-08  5:24     ` Mi, Dapeng
  2025-12-05 11:40   ` Peter Zijlstra
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2025-12-05 11:07 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:48PM +0800, Dapeng Mi wrote:

> @@ -545,6 +547,25 @@ struct perf_event_attr {
>  	__u64	sig_data;
>  
>  	__u64	config3; /* extension of config2 */
> +
> +
> +	/*
> +	 * Defines set of SIMD registers to dump on samples.
> +	 * The sample_simd_regs_enabled !=0 implies the
> +	 * set of SIMD registers is used to config all SIMD registers.
> +	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> +	 * config some SIMD registers on X86.
> +	 */
> +	union {
> +		__u16 sample_simd_regs_enabled;
> +		__u16 sample_simd_pred_reg_qwords;
> +	};
> +	__u32 sample_simd_pred_reg_intr;
> +	__u32 sample_simd_pred_reg_user;
> +	__u16 sample_simd_vec_reg_qwords;
> +	__u64 sample_simd_vec_reg_intr;
> +	__u64 sample_simd_vec_reg_user;
> +	__u32 __reserved_4;
>  };

This is poorly aligned and causes holes.

This:

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d292f96bc06f..2deb8dd0ca37 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -545,6 +545,14 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+	__u16	sample_simd_pred_reg_qwords;
+	__u32	sample_simd_pred_reg_intr;
+	__u32	sample_simd_pred_reg_user;
+	__u16	sample_simd_vec_reg_qwords;
+	__u64	sample_simd_vec_reg_intr;
+	__u64	sample_simd_vec_reg_user;
+	__u32	__reserved_4;
 };
 
 /*

results in:

        __u64                      config3;              /*   128     8 */
        __u16                      sample_simd_pred_reg_qwords; /*   136     2 */

        /* XXX 2 bytes hole, try to pack */

        __u32                      sample_simd_pred_reg_intr; /*   140     4 */
        __u32                      sample_simd_pred_reg_user; /*   144     4 */
        __u16                      sample_simd_vec_reg_qwords; /*   148     2 */

        /* XXX 2 bytes hole, try to pack */

        __u64                      sample_simd_vec_reg_intr; /*   152     8 */
        __u64                      sample_simd_vec_reg_user; /*   160     8 */
        __u32                      __reserved_4;         /*   168     4 */



A better layout might be:

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d292f96bc06f..f72707e9df68 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -545,6 +545,15 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+	__u16	sample_simd_pred_reg_qwords;
+	__u16	sample_simd_vec_reg_qwords;
+	__u32	__reserved_4;
+
+	__u32	sample_simd_pred_reg_intr;
+	__u32	sample_simd_pred_reg_user;
+	__u64	sample_simd_vec_reg_intr;
+	__u64	sample_simd_vec_reg_user;
 };
 
 /*

such that:

        __u64                      config3;              /*   128     8 */
        __u16                      sample_simd_pred_reg_qwords; /*   136     2 */
        __u16                      sample_simd_vec_reg_qwords; /*   138     2 */
        __u32                      __reserved_4;         /*   140     4 */
        __u32                      sample_simd_pred_reg_intr; /*   144     4 */
        __u32                      sample_simd_pred_reg_user; /*   148     4 */
        __u64                      sample_simd_vec_reg_intr; /*   152     8 */
        __u64                      sample_simd_vec_reg_user; /*   160     8 */



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
  2025-12-03  6:54 ` [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
@ 2025-12-05 11:25   ` Peter Zijlstra
  2025-12-08  6:10     ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2025-12-05 11:25 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:49PM +0800, Dapeng Mi wrote:

> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
> index 7c9d2bb3833b..c3862e5fdd6d 100644
> --- a/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/arch/x86/include/uapi/asm/perf_regs.h
> @@ -55,4 +55,21 @@ enum perf_event_x86_regs {
>  
>  #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
>  
> +enum {
> +	PERF_REG_X86_XMM,
> +	PERF_REG_X86_MAX_SIMD_REGS,
> +};
> +
> +enum {
> +	PERF_X86_SIMD_XMM_REGS      = 16,
> +	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
> +};
> +
> +#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
> +
> +enum {
> +	PERF_X86_XMM_QWORDS      = 2,
> +	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
> +};
> +
>  #endif /* _ASM_X86_PERF_REGS_H */

I don't understand this bit -- the next few patches add to it for YMM
and ZMM, but what's the point? I don't see why this is needed at all,
let alone why it needs to be UABI.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 07/19] perf: Add sampling support for SIMD registers
  2025-12-03  6:54 ` [Patch v5 07/19] perf: Add sampling support for SIMD registers Dapeng Mi
  2025-12-05 11:07   ` Peter Zijlstra
@ 2025-12-05 11:40   ` Peter Zijlstra
  2025-12-08  6:00     ` Mi, Dapeng
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2025-12-05 11:40 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:48PM +0800, Dapeng Mi wrote:

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3e9c48fa2202..b19de038979e 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7469,6 +7469,50 @@ perf_output_sample_regs(struct perf_output_handle *handle,
>  	}
>  }
>  
> +static void
> +perf_output_sample_simd_regs(struct perf_output_handle *handle,
> +			     struct perf_event *event,
> +			     struct pt_regs *regs,
> +			     u64 mask, u16 pred_mask)
> +{
> +	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
> +	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
> +	u64 pred_bitmap = pred_mask;
> +	u64 bitmap = mask;
> +	u16 nr_vectors;
> +	u16 nr_pred;
> +	int bit;
> +	u64 val;
> +	u16 i;
> +
> +	nr_vectors = hweight64(bitmap);
> +	nr_pred = hweight64(pred_bitmap);
> +
> +	perf_output_put(handle, nr_vectors);
> +	perf_output_put(handle, vec_qwords);
> +	perf_output_put(handle, nr_pred);
> +	perf_output_put(handle, pred_qwords);
> +
> +	if (nr_vectors) {
> +		for_each_set_bit(bit, (unsigned long *)&bitmap,

This isn't right. Yes we do this all the time in the x86 code, but there
we can assume little-endian byte order. This is core code and is also
used on big-endian systems where this is very much broken.

> +				 sizeof(bitmap) * BITS_PER_BYTE) {
> +			for (i = 0; i < vec_qwords; i++) {
> +				val = perf_simd_reg_value(regs, bit, i, false);
> +				perf_output_put(handle, val);
> +			}
> +		}
> +	}
> +	if (nr_pred) {
> +		for_each_set_bit(bit, (unsigned long *)&pred_bitmap,
> +				 sizeof(pred_bitmap) * BITS_PER_BYTE) {
> +			for (i = 0; i < pred_qwords; i++) {
> +				val = perf_simd_reg_value(regs, bit, i, true);
> +				perf_output_put(handle, val);
> +			}
> +		}
> +	}
> +}

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields
  2025-12-03  6:54 ` [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields Dapeng Mi
@ 2025-12-05 12:16   ` Peter Zijlstra
  2025-12-08  6:11     ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2025-12-05 12:16 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:53PM +0800, Dapeng Mi wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> This patch enables sampling of APX eGPRs (R16 ~ R31) via the
> sample_regs_* fields.
> 
> To sample eGPRs, the sample_simd_regs_enabled field must be set. This
> allows the spare space (reclaimed from the original XMM space) in the
> sample_regs_* fields to be used for representing eGPRs.
> 
> The perf_reg_value() function needs to check if the
> PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then determine whether
> to output eGPRs or legacy XMM registers to userspace.
> 
> The perf_reg_validate() function is enhanced to validate the eGPRs bitmap
> by adding a new argument, "simd_enabled".
> 
> Currently, eGPRs sampling is only supported on the x86_64 architecture, as
> APX is only available on x86_64 platforms.
> 
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/arm/kernel/perf_regs.c           |  2 +-
>  arch/arm64/kernel/perf_regs.c         |  2 +-
>  arch/csky/kernel/perf_regs.c          |  2 +-
>  arch/loongarch/kernel/perf_regs.c     |  2 +-
>  arch/mips/kernel/perf_regs.c          |  2 +-
>  arch/parisc/kernel/perf_regs.c        |  2 +-
>  arch/powerpc/perf/perf_regs.c         |  2 +-
>  arch/riscv/kernel/perf_regs.c         |  2 +-
>  arch/s390/kernel/perf_regs.c          |  2 +-

Perhaps split out the part where you modify the arch function interface?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 13/19] perf/x86: Enable SSP sampling using sample_regs_* fields
  2025-12-03  6:54 ` [Patch v5 13/19] perf/x86: Enable SSP " Dapeng Mi
@ 2025-12-05 12:20   ` Peter Zijlstra
  2025-12-08  6:21     ` Mi, Dapeng
  2025-12-24  5:45   ` Ravi Bangoria
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2025-12-05 12:20 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Dec 03, 2025 at 02:54:54PM +0800, Dapeng Mi wrote:
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index ca242db3720f..c925af4160ad 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -729,6 +729,10 @@ struct x86_perf_regs {
>  		u64	*egpr_regs;
>  		struct apx_state *egpr;
>  	};
> +	union {
> +		u64	*cet_regs;
> +		struct cet_user_state *cet;
> +	};
>  };

Are we envisioning more than just SSP?


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-03  6:54 ` [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
@ 2025-12-05 12:39   ` Peter Zijlstra
  2025-12-07 20:44     ` Andi Kleen
  2025-12-08  6:46     ` Mi, Dapeng
  0 siblings, 2 replies; 86+ messages in thread
From: Peter Zijlstra @ 2025-12-05 12:39 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

On Wed, Dec 03, 2025 at 02:54:57PM +0800, Dapeng Mi wrote:
> When two or more identical PEBS events with the same sampling period are
> programmed on a mix of PDIST and non-PDIST counters, multiple
> back-to-back NMIs can be triggered.

This is a hardware defect -- albeit a fairly common one.


> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index da48bcde8fce..a130d3f14844 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3351,8 +3351,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>  	 */
>  	if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
>  				 (unsigned long *)&status)) {
> -		handled++;
> -		static_call(x86_pmu_drain_pebs)(regs, &data);
> +		handled += static_call(x86_pmu_drain_pebs)(regs, &data);
>  
>  		if (cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS] &&
>  		    is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS]))

Note that the old code would return handled++, while the new code:

> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index a01c72c03bd6..c7cdcd585574 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -2759,7 +2759,7 @@ __intel_pmu_pebs_events(struct perf_event *event,
>  	__intel_pmu_pebs_last_event(event, iregs, regs, data, at, count, setup_sample);
>  }
>  
> -static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
> +static int intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
>  {
>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>  	struct debug_store *ds = cpuc->ds;
> @@ -2768,7 +2768,7 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
>  	int n;
>  
>  	if (!x86_pmu.pebs_active)
> -		return;
> +		return 0;
>  
>  	at  = (struct pebs_record_core *)(unsigned long)ds->pebs_buffer_base;
>  	top = (struct pebs_record_core *)(unsigned long)ds->pebs_index;
> @@ -2779,22 +2779,24 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
>  	ds->pebs_index = ds->pebs_buffer_base;
>  
>  	if (!test_bit(0, cpuc->active_mask))
> -		return;
> +		return 0;
>  
>  	WARN_ON_ONCE(!event);
>  
>  	if (!event->attr.precise_ip)
> -		return;
> +		return 0;
>  
>  	n = top - at;
>  	if (n <= 0) {
>  		if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
>  			intel_pmu_save_and_restart_reload(event, 0);
> -		return;
> +		return 0;
>  	}
>  
>  	__intel_pmu_pebs_events(event, iregs, data, at, top, 0, n,
>  				setup_pebs_fixed_sample_data);
> +
> +	return 0;
>  }
>  
>  static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64 mask)
> @@ -2817,7 +2819,7 @@ static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64
>  	}
>  }
>  
> -static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
> +static int intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
>  {
>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>  	struct debug_store *ds = cpuc->ds;
> @@ -2830,7 +2832,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>  	u64 mask;
>  
>  	if (!x86_pmu.pebs_active)
> -		return;
> +		return 0;
>  
>  	base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
>  	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
> @@ -2846,7 +2848,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>  
>  	if (unlikely(base >= top)) {
>  		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
> -		return;
> +		return 0;
>  	}
>  
>  	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> @@ -2931,6 +2933,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>  						setup_pebs_fixed_sample_data);
>  		}
>  	}
> +
> +	return 0;
>  }
>  
>  static __always_inline void
> @@ -2984,7 +2988,7 @@ __intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
>  
>  }
>  
> -static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
> +static int intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
>  {
>  	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
>  	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
> @@ -2997,7 +3001,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>  	u64 mask;
>  
>  	if (!x86_pmu.pebs_active)
> -		return;
> +		return 0;
>  
>  	base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
>  	top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
> @@ -3010,7 +3014,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>  
>  	if (unlikely(base >= top)) {
>  		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
> -		return;
> +		return 0;
>  	}
>  
>  	if (!iregs)
> @@ -3032,9 +3036,11 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>  
>  	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
>  					    setup_pebs_adaptive_sample_data);
> +
> +	return 0;
>  }

will now return handled+=0 for all these. Which is a change in
behaviour. Also:

> -static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
> +static int intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>  				      struct perf_sample_data *data)
>  {
>  	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
> @@ -3044,13 +3050,14 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>  	struct x86_perf_regs perf_regs;
>  	struct pt_regs *regs = &perf_regs.regs;
>  	void *base, *at, *top;
> +	u64 events_bitmap = 0;
>  	u64 mask;
>  
>  	rdmsrq(MSR_IA32_PEBS_INDEX, index.whole);
>  
>  	if (unlikely(!index.wr)) {
>  		intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
> -		return;
> +		return 0;
>  	}
>  
>  	base = cpuc->pebs_vaddr;
> @@ -3089,6 +3096,7 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>  
>  		basic = at + sizeof(struct arch_pebs_header);
>  		pebs_status = mask & basic->applicable_counters;
> +		events_bitmap |= pebs_status;
>  		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
>  					       pebs_status, counts, last,
>  					       setup_arch_pebs_sample_data);
> @@ -3108,6 +3116,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>  	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask,
>  					    counts, last,
>  					    setup_arch_pebs_sample_data);
> +
	/*
	 * Comment that explains the arch pebs defect goes here.
	 */
> +	return hweight64(events_bitmap);
>  }

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-05  8:10                 ` Mi, Dapeng
@ 2025-12-05 16:35                   ` Ian Rogers
  2025-12-08  4:20                     ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2025-12-05 16:35 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Fri, Dec 5, 2025 at 12:10 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 12/5/2025 2:38 PM, Ian Rogers wrote:
> > On Thu, Dec 4, 2025 at 8:00 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 12/5/2025 12:16 AM, Ian Rogers wrote:
> >>> On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>> On 12/4/2025 3:49 PM, Ian Rogers wrote:
> >>>>> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>>>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
> >>>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >>>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>>>>>
> >>>>>>>> This patch adds support for the newly introduced SIMD register sampling
> >>>>>>>> format by adding the following functions:
> >>>>>>>>
> >>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
> >>>>>>>> uint64_t arch__user_simd_reg_mask(void);
> >>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
> >>>>>>>> uint64_t arch__user_pred_reg_mask(void);
> >>>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >>>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >>>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >>>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
> >>>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
> >>>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >>>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
> >>>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >>>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >>>>>>>> OPMASK).
> >>>>>>>>
> >>>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
> >>>>>>>> these newly introduced SIMD registers. Currently, each type of register
> >>>>>>>> can only be sampled collectively; sampling a specific SIMD register is
> >>>>>>>> not supported. For example, all XMM registers are sampled together rather
> >>>>>>>> than sampling only XMM0.
> >>>>>>>>
> >>>>>>>> When multiple overlapping register types, such as XMM and YMM, are
> >>>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>>>>>>>
> >>>>>>>> With this patch, all supported sampling registers on x86 platforms are
> >>>>>>>> displayed as follows.
> >>>>>>>>
> >>>>>>>>  $perf record -I?
> >>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>>>
> >>>>>>>>  $perf record --user-regs=?
> >>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>>>
> >>>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>>>> ---
> >>>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>>>>>>>  tools/perf/util/evsel.c                   |  27 ++
> >>>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
> >>>>>>>>  tools/perf/util/perf_regs.h               |  11 +
> >>>>>>>>  tools/perf/util/record.h                  |   6 +
> >>>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>>>> index 12fd93f04802..db41430f3b07 100644
> >>>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
> >>>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>>>> @@ -13,6 +13,49 @@
> >>>>>>>>  #include "../../../util/pmu.h"
> >>>>>>>>  #include "../../../util/pmus.h"
> >>>>>>>>
> >>>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
> >>>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >>>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >>>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >>>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >>>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >>>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >>>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >>>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >>>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >>>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >>>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >>>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >>>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >>>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >>>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >>>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >>>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >>>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >>>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >>>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >>>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >>>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >>>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >>>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >>>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >>>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >>>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >>>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >>>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >>>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >>>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >>>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >>>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >>>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >>>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >>>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >>>>>>>> +#endif
> >>>>>>>> +       SMPL_REG_END
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>>>>>>>         return SDT_ARG_VALID;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> >>>>>>> To make the code easier to read, it'd be nice to document sample_type,
> >>>>>>> qwords and mask here.
> >>>>>> Sure.
> >>>>>>
> >>>>>>
> >>>>>>>> +{
> >>>>>>>> +       struct perf_event_attr attr = {
> >>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>>>> +               .sample_type                    = sample_type,
> >>>>>>>> +               .disabled                       = 1,
> >>>>>>>> +               .exclude_kernel                 = 1,
> >>>>>>>> +               .sample_simd_regs_enabled       = 1,
> >>>>>>>> +       };
> >>>>>>>> +       int fd;
> >>>>>>>> +
> >>>>>>>> +       attr.sample_period = 1;
> >>>>>>>> +
> >>>>>>>> +       if (!pred) {
> >>>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
> >>>>>>>> +               else
> >>>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
> >>>>>>>> +       } else {
> >>>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >>>>>>>> +               else
> >>>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
> >>>>>>>> +               struct perf_pmu *pmu = NULL;
> >>>>>>>> +               __u64 type = PERF_TYPE_RAW;
> >>>>>>> It should be okay to do:
> >>>>>>> __u64 type = perf_pmus__find_core_pmu()->type
> >>>>>>> rather than have the whole loop below.
> >>>>>> Sure. Thanks.
> >>>>>>
> >>>>>>
> >>>>>>>> +
> >>>>>>>> +               /*
> >>>>>>>> +                * The same register set is supported among different hybrid PMUs.
> >>>>>>>> +                * Only check the first available one.
> >>>>>>>> +                */
> >>>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >>>>>>>> +                       type = pmu->type;
> >>>>>>>> +                       break;
> >>>>>>>> +               }
> >>>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       event_attr_init(&attr);
> >>>>>>>> +
> >>>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>>>> +       if (fd != -1) {
> >>>>>>>> +               close(fd);
> >>>>>>>> +               return true;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return false;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       bool supported = false;
> >>>>>>>> +       u64 bits;
> >>>>>>>> +
> >>>>>>>> +       *mask = 0;
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +
> >>>>>>>> +       switch (reg) {
> >>>>>>>> +       case PERF_REG_X86_XMM:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       case PERF_REG_X86_YMM:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       case PERF_REG_X86_ZMM:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >>>>>>>> +                       break;
> >>>>>>>> +               }
> >>>>>>>> +
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       default:
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return supported;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       bool supported = false;
> >>>>>>>> +       u64 bits;
> >>>>>>>> +
> >>>>>>>> +       *mask = 0;
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +
> >>>>>>>> +       switch (reg) {
> >>>>>>>> +       case PERF_REG_X86_OPMASK:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       default:
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return supported;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool has_cap_simd_regs(void)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >>>>>>>> +       static bool has_cap_simd_regs;
> >>>>>>>> +       static bool cached;
> >>>>>>>> +
> >>>>>>>> +       if (cached)
> >>>>>>>> +               return has_cap_simd_regs;
> >>>>>>>> +
> >>>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>>>> +       cached = true;
> >>>>>>>> +
> >>>>>>>> +       return has_cap_simd_regs;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +bool arch_has_simd_regs(u64 mask)
> >>>>>>>> +{
> >>>>>>>> +       return has_cap_simd_regs() &&
> >>>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
> >>>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >>>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >>>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >>>>>>>> +       SMPL_REG_END
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
> >>>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >>>>>>>> +       SMPL_REG_END
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_simd_reg_masks;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_pred_reg_masks;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool x86_intr_simd_updated;
> >>>>>>>> +static u64 x86_intr_simd_reg_mask;
> >>>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>> Could we add some comments? I can kind of figure out the updated is a
> >>>>>>> check for lazy initialization and what masks are, qwords is an odd
> >>>>>>> one. The comment could also point out that SIMD doesn't mean the
> >>>>>>> machine supports SIMD, but SIMD registers are supported in perf
> >>>>>>> events.
> >>>>>> Sure.
> >>>>>>
> >>>>>>
> >>>>>>>> +static bool x86_user_simd_updated;
> >>>>>>>> +static u64 x86_user_simd_reg_mask;
> >>>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>>> +
> >>>>>>>> +static bool x86_intr_pred_updated;
> >>>>>>>> +static u64 x86_intr_pred_reg_mask;
> >>>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +static bool x86_user_pred_updated;
> >>>>>>>> +static u64 x86_user_pred_reg_mask;
> >>>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +
> >>>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool supported;
> >>>>>>>> +       u64 mask = 0;
> >>>>>>>> +       int reg;
> >>>>>>>> +
> >>>>>>>> +       if (!has_cap_simd_regs())
> >>>>>>>> +               return 0;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >>>>>>>> +               return x86_intr_simd_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >>>>>>>> +               return x86_user_simd_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>>>> +               supported = false;
> >>>>>>>> +
> >>>>>>>> +               if (!r->mask)
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>>>> +
> >>>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >>>>>>>> +                       break;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_intr_simd_mask[reg],
> >>>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
> >>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_user_simd_mask[reg],
> >>>>>>>> +                                                        &x86_user_simd_qwords[reg]);
> >>>>>>>> +               if (supported)
> >>>>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>>>> +               x86_intr_simd_reg_mask = mask;
> >>>>>>>> +               x86_intr_simd_updated = true;
> >>>>>>>> +       } else {
> >>>>>>>> +               x86_user_simd_reg_mask = mask;
> >>>>>>>> +               x86_user_simd_updated = true;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool supported;
> >>>>>>>> +       u64 mask = 0;
> >>>>>>>> +       int reg;
> >>>>>>>> +
> >>>>>>>> +       if (!has_cap_simd_regs())
> >>>>>>>> +               return 0;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >>>>>>>> +               return x86_intr_pred_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >>>>>>>> +               return x86_user_pred_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>>>> +               supported = false;
> >>>>>>>> +
> >>>>>>>> +               if (!r->mask)
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>>>> +
> >>>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >>>>>>>> +                       break;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_intr_pred_mask[reg],
> >>>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
> >>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_user_pred_mask[reg],
> >>>>>>>> +                                                        &x86_user_pred_qwords[reg]);
> >>>>>>>> +               if (supported)
> >>>>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>>>> +               x86_intr_pred_reg_mask = mask;
> >>>>>>>> +               x86_intr_pred_updated = true;
> >>>>>>>> +       } else {
> >>>>>>>> +               x86_user_pred_reg_mask = mask;
> >>>>>>>> +               x86_user_pred_updated = true;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>> This feels repetitive with __arch__simd_reg_mask, could they be
> >>>>>>> refactored together?
> >>>>>> hmm, it looks we can extract the for loop as a common function. The other
> >>>>>> parts are hard to be generalized since they are manipulating different
> >>>>>> variables. If we want to generalize them, we have to introduce lots of "if
> >>>>>> ... else" branches and that would make code hard to be read.
> >>>>>>
> >>>>>>
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = 0;
> >>>>>>>> +
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >>>>>>>> +               if (intr) {
> >>>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
> >>>>>>>> +                       mask = x86_intr_simd_mask[reg];
> >>>>>>>> +               } else {
> >>>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
> >>>>>>>> +                       mask = x86_user_simd_mask[reg];
> >>>>>>>> +               }
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = 0;
> >>>>>>>> +
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >>>>>>>> +               if (intr) {
> >>>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
> >>>>>>>> +                       mask = x86_intr_pred_mask[reg];
> >>>>>>>> +               } else {
> >>>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
> >>>>>>>> +                       mask = x86_user_pred_mask[reg];
> >>>>>>>> +               }
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_intr_simd_updated)
> >>>>>>>> +               arch__intr_simd_reg_mask();
> >>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_user_simd_updated)
> >>>>>>>> +               arch__user_simd_reg_mask();
> >>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_intr_pred_updated)
> >>>>>>>> +               arch__intr_pred_reg_mask();
> >>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_user_pred_updated)
> >>>>>>>> +               arch__user_pred_reg_mask();
> >>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
> >>>>>>>>  {
> >>>>>>>> +       if (has_cap_simd_regs())
> >>>>>>>> +               return sample_reg_masks_ext;
> >>>>>>>>         return sample_reg_masks;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> -uint64_t arch__intr_reg_mask(void)
> >>>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>>>>>>>  {
> >>>>>>>>         struct perf_event_attr attr = {
> >>>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
> >>>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >>>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >>>>>>>> -               .precise_ip             = 1,
> >>>>>>>> -               .disabled               = 1,
> >>>>>>>> -               .exclude_kernel         = 1,
> >>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>>>> +               .sample_type                    = sample_type,
> >>>>>>>> +               .precise_ip                     = 1,
> >>>>>>>> +               .disabled                       = 1,
> >>>>>>>> +               .exclude_kernel                 = 1,
> >>>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>>>>>>>         };
> >>>>>>>>         int fd;
> >>>>>>>>         /*
> >>>>>>>>          * In an unnamed union, init it here to build on older gcc versions
> >>>>>>>>          */
> >>>>>>>>         attr.sample_period = 1;
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +               attr.sample_regs_intr = mask;
> >>>>>>>> +       else
> >>>>>>>> +               attr.sample_regs_user = mask;
> >>>>>>>>
> >>>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
> >>>>>>>>                 struct perf_pmu *pmu = NULL;
> >>>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>>>>         if (fd != -1) {
> >>>>>>>>                 close(fd);
> >>>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >>>>>>>> +               return mask;
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>> -       return PERF_REGS_MASK;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>>>> +
> >>>>>>>> +       if (has_cap_simd_regs()) {
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>>>> +                                        true);
> >>>>>>> It's nice to label constant arguments like this something like:
> >>>>>>> /*has_simd_regs=*/true);
> >>>>>>>
> >>>>>>> Tools like clang-tidy even try to enforce the argument names match the comments.
> >>>>>> Sure.
> >>>>>>
> >>>>>>
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>>>> +                                        true);
> >>>>>>>> +       } else
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>>  uint64_t arch__user_reg_mask(void)
> >>>>>>>>  {
> >>>>>>>> -       return PERF_REGS_MASK;
> >>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>>>> +
> >>>>>>>> +       if (has_cap_simd_regs()) {
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>>>> +                                        true);
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>>>> +                                        true);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>> The code is repetitive here, could we refactor into a single function
> >>>>>>> passing in a user or instr value?
> >>>>>> Sure. Would extract the common part.
> >>>>>>
> >>>>>>
> >>>>>>>>  }
> >>>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
> >>>>>>>> --- a/tools/perf/util/evsel.c
> >>>>>>>> +++ b/tools/perf/util/evsel.c
> >>>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>>>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >>>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >>>>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>>>> +               else
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>>>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
> >>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >>>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>>>> +               else
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >>>>>>>> index cda1c620968e..0bd100392889 100644
> >>>>>>>> --- a/tools/perf/util/parse-regs-options.c
> >>>>>>>> +++ b/tools/perf/util/parse-regs-options.c
> >>>>>>>> @@ -4,19 +4,139 @@
> >>>>>>>>  #include <stdint.h>
> >>>>>>>>  #include <string.h>
> >>>>>>>>  #include <stdio.h>
> >>>>>>>> +#include <linux/bitops.h>
> >>>>>>>>  #include "util/debug.h"
> >>>>>>>>  #include <subcmd/parse-options.h>
> >>>>>>>>  #include "util/perf_regs.h"
> >>>>>>>>  #include "util/parse-regs-options.h"
> >>>>>>>> +#include "record.h"
> >>>>>>>> +
> >>>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       if (!simd_mask)
> >>>>>>>> +               return;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (!(r->mask & simd_mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               if (bitmap)
> >>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>>>> +       }
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       if (!pred_mask)
> >>>>>>>> +               return;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (!(r->mask & pred_mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               if (bitmap)
> >>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>>>> +       }
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool matched = false;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (strcasecmp(s, r->name))
> >>>>>>>> +                       continue;
> >>>>>>>> +               if (!fls64(r->mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               matched = true;
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       /* Just need the highest qwords */
> >>>>>>> I'm not following here. Does the bitmap need to handle gaps?
> >>>>>> Currently no. In theory, the kernel supports user space only samples a
> >>>>>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
> >>>>>> supports 16 XMM registers on XMM), but it's not supported to avoid
> >>>>>> introducing too much complexity in perf tools. Moreover, I don't think end
> >>>>>> users have such requirement. In most cases, users should only know which
> >>>>>> kinds of SIMD registers their programs use but usually doesn't know and
> >>>>>> care about which exact SIMD register is used.
> >>>>>>
> >>>>>>
> >>>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
> >>>>>>>> +               opts->sample_vec_regs_qwords = qwords;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
> >>>>>>>> +               else
> >>>>>>>> +                       opts->sample_user_vec_regs = bitmap;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return matched;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool matched = false;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (strcasecmp(s, r->name))
> >>>>>>>> +                       continue;
> >>>>>>>> +               if (!fls64(r->mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               matched = true;
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       /* Just need the highest qwords */
> >>>>>>> Again repetitive, could we have a single function?
> >>>>>> Yes, I suppose the for loop at least can be extracted as a common function.
> >>>>>>
> >>>>>>
> >>>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
> >>>>>>>> +               opts->sample_pred_regs_qwords = qwords;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
> >>>>>>>> +               else
> >>>>>>>> +                       opts->sample_user_pred_regs = bitmap;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return matched;
> >>>>>>>> +}
> >>>>>>>>
> >>>>>>>>  static int
> >>>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>  {
> >>>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
> >>>>>>>>         const struct sample_reg *r = NULL;
> >>>>>>>> +       struct record_opts *opts;
> >>>>>>>>         char *s, *os = NULL, *p;
> >>>>>>>> -       int ret = -1;
> >>>>>>>> +       bool has_simd_regs = false;
> >>>>>>>>         uint64_t mask;
> >>>>>>>> +       uint64_t simd_mask;
> >>>>>>>> +       uint64_t pred_mask;
> >>>>>>>> +       int ret = -1;
> >>>>>>>>
> >>>>>>>>         if (unset)
> >>>>>>>>                 return 0;
> >>>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>         if (*mode)
> >>>>>>>>                 return -1;
> >>>>>>>>
> >>>>>>>> -       if (intr)
> >>>>>>>> +       if (intr) {
> >>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>>>>>>>                 mask = arch__intr_reg_mask();
> >>>>>>>> -       else
> >>>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
> >>>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
> >>>>>>>> +       } else {
> >>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>>>>>>>                 mask = arch__user_reg_mask();
> >>>>>>>> +               simd_mask = arch__user_simd_reg_mask();
> >>>>>>>> +               pred_mask = arch__user_pred_reg_mask();
> >>>>>>>> +       }
> >>>>>>>>
> >>>>>>>>         /* str may be NULL in case no arg is passed to -I */
> >>>>>>>>         if (str) {
> >>>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>                                         if (r->mask & mask)
> >>>>>>>>                                                 fprintf(stderr, "%s ", r->name);
> >>>>>>>>                                 }
> >>>>>>>> +                               __print_simd_regs(intr, simd_mask);
> >>>>>>>> +                               __print_pred_regs(intr, pred_mask);
> >>>>>>>>                                 fputc('\n', stderr);
> >>>>>>>>                                 /* just printing available regs */
> >>>>>>>>                                 goto error;
> >>>>>>>>                         }
> >>>>>>>> +
> >>>>>>>> +                       if (simd_mask) {
> >>>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >>>>>>>> +                               if (has_simd_regs)
> >>>>>>>> +                                       goto next;
> >>>>>>>> +                       }
> >>>>>>>> +                       if (pred_mask) {
> >>>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >>>>>>>> +                               if (has_simd_regs)
> >>>>>>>> +                                       goto next;
> >>>>>>>> +                       }
> >>>>>>>> +
> >>>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>>>>>>>                                         break;
> >>>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>                         }
> >>>>>>>>
> >>>>>>>>                         *mode |= r->mask;
> >>>>>>>> -
> >>>>>>>> +next:
> >>>>>>>>                         if (!p)
> >>>>>>>>                                 break;
> >>>>>>>>
> >>>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>         ret = 0;
> >>>>>>>>
> >>>>>>>>         /* default to all possible regs */
> >>>>>>>> -       if (*mode == 0)
> >>>>>>>> +       if (*mode == 0 && !has_simd_regs)
> >>>>>>>>                 *mode = mask;
> >>>>>>>>  error:
> >>>>>>>>         free(os);
> >>>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>>>> index 66b666d9ce64..fb0366d050cf 100644
> >>>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>>>>>>>
> >>>>>>>>         return ret;
> >>>>>>>>  }
> >>>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >>>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
> >>>>>>>> --- a/tools/perf/util/perf_regs.c
> >>>>>>>> +++ b/tools/perf/util/perf_regs.c
> >>>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>>>>>>>         return SDT_ARG_SKIP;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >>>>>>>> +{
> >>>>>>>> +       return false;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
> >>>>>>>>  {
> >>>>>>>>         return 0;
> >>>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>>>>>>>         return 0;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>>>         SMPL_REG_END
> >>>>>>>>  };
> >>>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>>>>>>>         return sample_reg_masks;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_reg_masks;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_reg_masks;
> >>>>>>>> +}
> >>>>>>> Thinking out loud. I wonder if there is a way to hide the weak
> >>>>>>> functions. It seems the support is tied to PMUs, particularly core
> >>>>>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
> >>>>>>> ask the PMU to parse the register strings, set up the perf_event_attr,
> >>>>>>> etc. I'm somewhat scared these functions will be used on the report
> >>>>>>> rather than record side of things, thereby breaking perf.data support
> >>>>>>> when the host kernel does or doesn't have the SIMD support.
> >>>>>> Ian, I don't quite follow your words.
> >>>>>>
> >>>>>> I don't quite understand how should we do for "push things into pmu and
> >>>>>> arch pmu code". Current SIMD registers support follows the same way of the
> >>>>>> general registers support. If we intend to change the way entirely, we'd
> >>>>>> better have an independent patch-set.
> >>>>>>
> >>>>>> why these functions would break the perf.data repport? perf-report would
> >>>>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
> >>>>>> the flags is set (indicates there are SIMD registers data appended in the
> >>>>>> record), perf-report would try to parse the SIMD registers data.
> >>>>> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
> >>>>> remove weak symbols like:
> >>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
> >>>>>
> >>>>> For these patches what I'm imagining is that there is a Nova Lake
> >>>>> generated perf.data file. Using perf report, script, etc. on the Nova
> >>>>> Lake should expose all of the same mask, qword, etc. values as when
> >>>>> the perf.data was generated and so things will work. If the perf.data
> >>>>> file was taken to say my Alderlake then what will happen? Generally
> >>>>> using the arch directory and weak symbols is a code smell that cross
> >>>>> platform things are going to break - there should be sufficient data
> >>>>> in the event and the perf_event_attr to fully decode what's going on.
> >>>>> Sometimes tying things to a PMU name can avoid the use of the arch
> >>>>> directory. We were able to avoid the arch directory to a good extent
> >>>>> for the TPEBS code, even though it is a very modern Intel feature.
> >>>> I see.
> >>>>
> >>>> But the sampling support for SIMD registers is different with the sample
> >>>> weight processing in the patch
> >>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
> >>>> Each arch may support different kinds of SIMD registers and furthermore
> >>>> each kind of SIMD register may have different register number and register
> >>>> width. It's quite hard to figure out some common functions or fields to
> >>>> represent the name and attributes of these arch-specific SIMD registers.
> >>>> These arch-specific information can only be told by the arch-specific code.
> >>>> So it looks the __weak functions are still the easiest way to implement this.
> >>>>
> >>>> I don't think the perf.data parsing would be broken from a platform to
> >>>> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
> >>>> To indicates the presence of SIMD registers in record data, a new ABI flag
> >>>> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
> >>>> platform is new enough and can recognize this new flag, then the SIMD
> >>>> registers data would be parsed correctly. Even though the perf tool is old
> >>>> and have no support of SIMD register, the data of SIMD registers would just
> >>>> be silently ignored and should not break the parsing.
> >>> That's good to know. I'm confused then why these functions can't just
> >>> be within the arch directory? For example, we don't expose the
> >>> intel-pt PMU code in the common code except for the parsing parts. A
> >>> lot of that is handled by the default perf_event_attr initialization
> >>> that every PMU can have its own variant of:
> >>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123
> >> I see. From my point of view, there seems no essential difference between a
> >> function pointer and a __weak function, and it looks hard to find a common
> >> data structure to save all these function pointers which needs to be called
> >> in different places, like register name parsing, register data dumpling ...
> >>
> >>
> >>> Perhaps this is all just evidence of tech debt in the perf_regs.c code
> >>> :-/ The bit that's relevant to the patch here is that I think this is
> >>> adding to the tech debt problem as 11 more functions are added to
> >>> perf_regs.h.
> >> Yeah, 11 new __weak functions seems too much, we may merge the same kinds
> >> of functions, like merging *_simd_reg_mask() and  *_pred_reg_mask() to a
> >> single function with an type argument, then the new added __weak functions
> >> could shrink half.
> > There could be a good reason for 11 weak functions :-) In the
> > perf_event.h you've added to the sample event:
> > ```
> > +        *        u64                   regs[weight(mask)];
> > +        *        struct {
> > +        *              u16 nr_vectors;
> > +        *              u16 vector_qwords;
> > +        *              u16 nr_pred;
> > +        *              u16 pred_qwords;
> > +        *              u64 data[nr_vectors * vector_qwords + nr_pred
> > * pred_qwords];
> > +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> > +        *      } && PERF_SAMPLE_REGS_USER
> > ```
> > so these things are readable/writable outside of builds with arch/x86
> > compiled in, which is why it seems odd that there needs to be arch
> > code in the common code to handle them. Similar to how I needed to get
> > the retirement latency parsing out of the arch/x86 directory as
> > potentially you could be looking at a perf.data file with retirement
> > latencies in it on a non-x86 platform.
>
> Ian, I'm not sure if I fully get your point. If not, please correct.
>
> Although these new introduced fields are generic and existed on all
> architectures, it's not enough to get all the necessary information to dump
> or parse the SIMD registers, e.g., the SIMD register name.
>
> Let's take dumpling the sampled value of SIMD registers as an example.

> We know there could be different kinds of SIMD register on different archs,
> like XMM/YMM/ZMM on x86 and V-registers/Z-registers on ARM.
>
> Currently we only know the register number and width from generic fields,
> we have no way to directly know the exact name this SIMD register
> corresponds. We have to involve the arch-specific function to figure out it
> and then print them.
>
> At least for now, it looks we still need these arch-specific functions ...

Thanks Dapeng. I started by thinking out loud, so I'm not saying this
is something to necessarily fix in the patch series but it probably is
something that needs to be fixed.

You mention that different archs have different registers and so we
need different routines for those archs, implying weak symbols, etc.
We do actually have generic register dumping code in get_dwarf_regstr:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/dwarf-regs.c?h=perf-tools-next#n33
It takes the dwarf register number, the ELF Ehdr e_machine and for the
purposes of csky the e_flags. If you want the e_machine for the perf
binary itself (such as in perf record when you don't yet have a
perf.data file) there is an EM_HOST value:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/include/dwarf-regs.h?h=perf-tools-next#n27
Perf has historically used a CPUID string, but I'd like to deprecate
that in favor of just using e_machine (and possibly e_flags) values.
We should probably have CPUID string to e_machine convesion utility
functions and remove cpuid from the perf_env:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/env.h?h=perf-tools-next#n67
but anyway, my point isn't about the e_machine values.

What I'm trying to say is that weak symbols and code in arch
inherently means the cross platform development will break. For
example, before:
https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/
perf_parse_sample_weight just simply didn't exist outside of PowerPC
and x86. This meant that the part of the perf event in the perf.data
containing the sample weights couldn't be parsed on say an ARM64 build
of perf. This meant the values couldn't even be dumped in perf script.
The values are, however, described in the cross platform perf sample
event format, much as the SIMD registers are here.

It seems as we have from a perf.data file at least a CPUID string from
the header features, a perf_event_attr and the register number, we
should be able to do something like get_dwarf_regstr. Such a function
wouldn't be in the arch directory as we wouldn't want to interpret
registers in events just on x86 platforms (as with the retirement
latency). If we're not able to do this then there seems to be
something wrong with the SIMD change and perhaps we need to capture
more information in the perf.data file header.

Thanks,
Ian

> >
> > Thanks,
> > Ian
> >
> >>> Thanks,
> >>> Ian
> >>>
> >>>>> Thanks,
> >>>>> Ian
> >>>>>
> >>>>>
> >>>>>
> >>>>>>> Thanks,
> >>>>>>> Ian
> >>>>>>>
> >>>>>>>> +
> >>>>>>>>  const char *perf_reg_name(int id, const char *arch)
> >>>>>>>>  {
> >>>>>>>>         const char *reg_name = NULL;
> >>>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >>>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
> >>>>>>>> --- a/tools/perf/util/perf_regs.h
> >>>>>>>> +++ b/tools/perf/util/perf_regs.h
> >>>>>>>> @@ -24,9 +24,20 @@ enum {
> >>>>>>>>  };
> >>>>>>>>
> >>>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >>>>>>>> +bool arch_has_simd_regs(u64 mask);
> >>>>>>>>  uint64_t arch__intr_reg_mask(void);
> >>>>>>>>  uint64_t arch__user_reg_mask(void);
> >>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
> >>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> >>>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
> >>>>>>>> +uint64_t arch__user_simd_reg_mask(void);
> >>>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
> >>>>>>>> +uint64_t arch__user_pred_reg_mask(void);
> >>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>>
> >>>>>>>>  const char *perf_reg_name(int id, const char *arch);
> >>>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >>>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >>>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
> >>>>>>>> --- a/tools/perf/util/record.h
> >>>>>>>> +++ b/tools/perf/util/record.h
> >>>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
> >>>>>>>>         unsigned int  user_freq;
> >>>>>>>>         u64           branch_stack;
> >>>>>>>>         u64           sample_intr_regs;
> >>>>>>>> +       u64           sample_intr_vec_regs;
> >>>>>>>>         u64           sample_user_regs;
> >>>>>>>> +       u64           sample_user_vec_regs;
> >>>>>>>> +       u16           sample_pred_regs_qwords;
> >>>>>>>> +       u16           sample_vec_regs_qwords;
> >>>>>>>> +       u16           sample_intr_pred_regs;
> >>>>>>>> +       u16           sample_user_pred_regs;
> >>>>>>>>         u64           default_interval;
> >>>>>>>>         u64           user_interval;
> >>>>>>>>         size_t        auxtrace_snapshot_size;
> >>>>>>>> --
> >>>>>>>> 2.34.1
> >>>>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-05 12:39   ` Peter Zijlstra
@ 2025-12-07 20:44     ` Andi Kleen
  2025-12-08  6:46     ` Mi, Dapeng
  1 sibling, 0 replies; 86+ messages in thread
From: Andi Kleen @ 2025-12-07 20:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Eranian Stephane, Mark Rutland,
	broonie, Ravi Bangoria, linux-kernel, linux-perf-users, Zide Chen,
	Falcon Thomas, Dapeng Mi, Xudong Hao

On Fri, Dec 05, 2025 at 01:39:40PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:57PM +0800, Dapeng Mi wrote:
> > When two or more identical PEBS events with the same sampling period are
> > programmed on a mix of PDIST and non-PDIST counters, multiple
> > back-to-back NMIs can be triggered.
> 
> This is a hardware defect -- albeit a fairly common one.

Actually I disagree on that. PEBS is essentially a shared memory
protocol between two asynchronous agents. To prevent this you would need a
locking protocol somehow for the memory, otherwise the sender (PEBS) has
no way to know that the PMI handler is finished reading the memory
buffers.

So it cannot know that the second event was already parsed, and
has to send the second PMI just in case.

It didn't happen with the legacy PEBS because it always 
collapsed multiple counters into one, but that was really a race
too.

-Andi

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-05 16:35                   ` Ian Rogers
@ 2025-12-08  4:20                     ` Mi, Dapeng
  2026-01-06  7:27                       ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-08  4:20 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/6/2025 12:35 AM, Ian Rogers wrote:
> On Fri, Dec 5, 2025 at 12:10 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 12/5/2025 2:38 PM, Ian Rogers wrote:
>>> On Thu, Dec 4, 2025 at 8:00 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 12/5/2025 12:16 AM, Ian Rogers wrote:
>>>>> On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>> On 12/4/2025 3:49 PM, Ian Rogers wrote:
>>>>>>> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
>>>>>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>>>>
>>>>>>>>>> This patch adds support for the newly introduced SIMD register sampling
>>>>>>>>>> format by adding the following functions:
>>>>>>>>>>
>>>>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>>>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>>>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>>>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>>>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>>>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>>>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>>>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>>>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>>>>>>>> OPMASK).
>>>>>>>>>>
>>>>>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>>>>>>>> these newly introduced SIMD registers. Currently, each type of register
>>>>>>>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>>>>>>>> not supported. For example, all XMM registers are sampled together rather
>>>>>>>>>> than sampling only XMM0.
>>>>>>>>>>
>>>>>>>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>>>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>>>>>>>
>>>>>>>>>> With this patch, all supported sampling registers on x86 platforms are
>>>>>>>>>> displayed as follows.
>>>>>>>>>>
>>>>>>>>>>  $perf record -I?
>>>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>>>
>>>>>>>>>>  $perf record --user-regs=?
>>>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>>>> ---
>>>>>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>>>>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>>>>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>>>>>>>  tools/perf/util/record.h                  |   6 +
>>>>>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>> index 12fd93f04802..db41430f3b07 100644
>>>>>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>> @@ -13,6 +13,49 @@
>>>>>>>>>>  #include "../../../util/pmu.h"
>>>>>>>>>>  #include "../../../util/pmus.h"
>>>>>>>>>>
>>>>>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>>>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>>>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>>>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>>>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>>>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>>>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>>>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>>>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>>>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>>>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>>>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>>>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>>>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>>>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>>>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>>>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>>>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>>>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>>>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>>>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>>>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>>>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>>>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>>>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>>>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>>>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>>>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>>>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>>>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>>>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>>>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>>>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>>>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>>>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>>>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>>>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>>>>>>>> +#endif
>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>>>>>>>         return SDT_ARG_VALID;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>>>>>>>> To make the code easier to read, it'd be nice to document sample_type,
>>>>>>>>> qwords and mask here.
>>>>>>>> Sure.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +{
>>>>>>>>>> +       struct perf_event_attr attr = {
>>>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>>>> +               .disabled                       = 1,
>>>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>>>> +               .sample_simd_regs_enabled       = 1,
>>>>>>>>>> +       };
>>>>>>>>>> +       int fd;
>>>>>>>>>> +
>>>>>>>>>> +       attr.sample_period = 1;
>>>>>>>>>> +
>>>>>>>>>> +       if (!pred) {
>>>>>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>>>>>>>> +       } else {
>>>>>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>>>> +               struct perf_pmu *pmu = NULL;
>>>>>>>>>> +               __u64 type = PERF_TYPE_RAW;
>>>>>>>>> It should be okay to do:
>>>>>>>>> __u64 type = perf_pmus__find_core_pmu()->type
>>>>>>>>> rather than have the whole loop below.
>>>>>>>> Sure. Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> +               /*
>>>>>>>>>> +                * The same register set is supported among different hybrid PMUs.
>>>>>>>>>> +                * Only check the first available one.
>>>>>>>>>> +                */
>>>>>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>>>>>>>> +                       type = pmu->type;
>>>>>>>>>> +                       break;
>>>>>>>>>> +               }
>>>>>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       event_attr_init(&attr);
>>>>>>>>>> +
>>>>>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>>>> +       if (fd != -1) {
>>>>>>>>>> +               close(fd);
>>>>>>>>>> +               return true;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return false;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       bool supported = false;
>>>>>>>>>> +       u64 bits;
>>>>>>>>>> +
>>>>>>>>>> +       *mask = 0;
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +
>>>>>>>>>> +       switch (reg) {
>>>>>>>>>> +       case PERF_REG_X86_XMM:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       case PERF_REG_X86_YMM:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       case PERF_REG_X86_ZMM:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>>>>>>>> +                       break;
>>>>>>>>>> +               }
>>>>>>>>>> +
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       default:
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return supported;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       bool supported = false;
>>>>>>>>>> +       u64 bits;
>>>>>>>>>> +
>>>>>>>>>> +       *mask = 0;
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +
>>>>>>>>>> +       switch (reg) {
>>>>>>>>>> +       case PERF_REG_X86_OPMASK:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       default:
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return supported;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool has_cap_simd_regs(void)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>>>>>>>> +       static bool has_cap_simd_regs;
>>>>>>>>>> +       static bool cached;
>>>>>>>>>> +
>>>>>>>>>> +       if (cached)
>>>>>>>>>> +               return has_cap_simd_regs;
>>>>>>>>>> +
>>>>>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>>>> +       cached = true;
>>>>>>>>>> +
>>>>>>>>>> +       return has_cap_simd_regs;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +bool arch_has_simd_regs(u64 mask)
>>>>>>>>>> +{
>>>>>>>>>> +       return has_cap_simd_regs() &&
>>>>>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>>>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>>>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>>>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>>>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_simd_reg_masks;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_pred_reg_masks;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool x86_intr_simd_updated;
>>>>>>>>>> +static u64 x86_intr_simd_reg_mask;
>>>>>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>> Could we add some comments? I can kind of figure out the updated is a
>>>>>>>>> check for lazy initialization and what masks are, qwords is an odd
>>>>>>>>> one. The comment could also point out that SIMD doesn't mean the
>>>>>>>>> machine supports SIMD, but SIMD registers are supported in perf
>>>>>>>>> events.
>>>>>>>> Sure.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +static bool x86_user_simd_updated;
>>>>>>>>>> +static u64 x86_user_simd_reg_mask;
>>>>>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>> +
>>>>>>>>>> +static bool x86_intr_pred_updated;
>>>>>>>>>> +static u64 x86_intr_pred_reg_mask;
>>>>>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +static bool x86_user_pred_updated;
>>>>>>>>>> +static u64 x86_user_pred_reg_mask;
>>>>>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool supported;
>>>>>>>>>> +       u64 mask = 0;
>>>>>>>>>> +       int reg;
>>>>>>>>>> +
>>>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>>>> +               return 0;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>>>>>>>> +               return x86_intr_simd_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>>>>>>>> +               return x86_user_simd_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>> +               supported = false;
>>>>>>>>>> +
>>>>>>>>>> +               if (!r->mask)
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>>>> +
>>>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>>>>>>>> +                       break;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_intr_simd_mask[reg],
>>>>>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_user_simd_mask[reg],
>>>>>>>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>>>>>>>> +               if (supported)
>>>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>>>> +               x86_intr_simd_reg_mask = mask;
>>>>>>>>>> +               x86_intr_simd_updated = true;
>>>>>>>>>> +       } else {
>>>>>>>>>> +               x86_user_simd_reg_mask = mask;
>>>>>>>>>> +               x86_user_simd_updated = true;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool supported;
>>>>>>>>>> +       u64 mask = 0;
>>>>>>>>>> +       int reg;
>>>>>>>>>> +
>>>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>>>> +               return 0;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>>>>>>>> +               return x86_intr_pred_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>>>>>>>> +               return x86_user_pred_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>> +               supported = false;
>>>>>>>>>> +
>>>>>>>>>> +               if (!r->mask)
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>>>> +
>>>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>>>>>>>> +                       break;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_intr_pred_mask[reg],
>>>>>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_user_pred_mask[reg],
>>>>>>>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>>>>>>>> +               if (supported)
>>>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>>>> +               x86_intr_pred_reg_mask = mask;
>>>>>>>>>> +               x86_intr_pred_updated = true;
>>>>>>>>>> +       } else {
>>>>>>>>>> +               x86_user_pred_reg_mask = mask;
>>>>>>>>>> +               x86_user_pred_updated = true;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>> This feels repetitive with __arch__simd_reg_mask, could they be
>>>>>>>>> refactored together?
>>>>>>>> hmm, it looks we can extract the for loop as a common function. The other
>>>>>>>> parts are hard to be generalized since they are manipulating different
>>>>>>>> variables. If we want to generalize them, we have to introduce lots of "if
>>>>>>>> ... else" branches and that would make code hard to be read.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = 0;
>>>>>>>>>> +
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>>>>>>>> +               if (intr) {
>>>>>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>>>>>>>> +                       mask = x86_intr_simd_mask[reg];
>>>>>>>>>> +               } else {
>>>>>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>>>>>>>> +                       mask = x86_user_simd_mask[reg];
>>>>>>>>>> +               }
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = 0;
>>>>>>>>>> +
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>>>>>>>> +               if (intr) {
>>>>>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>>>>>>>> +                       mask = x86_intr_pred_mask[reg];
>>>>>>>>>> +               } else {
>>>>>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>>>>>>>> +                       mask = x86_user_pred_mask[reg];
>>>>>>>>>> +               }
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_intr_simd_updated)
>>>>>>>>>> +               arch__intr_simd_reg_mask();
>>>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_user_simd_updated)
>>>>>>>>>> +               arch__user_simd_reg_mask();
>>>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_intr_pred_updated)
>>>>>>>>>> +               arch__intr_pred_reg_mask();
>>>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_user_pred_updated)
>>>>>>>>>> +               arch__user_pred_reg_mask();
>>>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>>>>>>>  {
>>>>>>>>>> +       if (has_cap_simd_regs())
>>>>>>>>>> +               return sample_reg_masks_ext;
>>>>>>>>>>         return sample_reg_masks;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> -uint64_t arch__intr_reg_mask(void)
>>>>>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>>>>>>>  {
>>>>>>>>>>         struct perf_event_attr attr = {
>>>>>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>>>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>>>>>>>> -               .precise_ip             = 1,
>>>>>>>>>> -               .disabled               = 1,
>>>>>>>>>> -               .exclude_kernel         = 1,
>>>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>>>> +               .precise_ip                     = 1,
>>>>>>>>>> +               .disabled                       = 1,
>>>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>>>>>>>         };
>>>>>>>>>>         int fd;
>>>>>>>>>>         /*
>>>>>>>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>>>>>>>          */
>>>>>>>>>>         attr.sample_period = 1;
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +               attr.sample_regs_intr = mask;
>>>>>>>>>> +       else
>>>>>>>>>> +               attr.sample_regs_user = mask;
>>>>>>>>>>
>>>>>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>>>>                 struct perf_pmu *pmu = NULL;
>>>>>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>>>>         if (fd != -1) {
>>>>>>>>>>                 close(fd);
>>>>>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>>>>>>>> +               return mask;
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>>>> +
>>>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>>>> +                                        true);
>>>>>>>>> It's nice to label constant arguments like this something like:
>>>>>>>>> /*has_simd_regs=*/true);
>>>>>>>>>
>>>>>>>>> Tools like clang-tidy even try to enforce the argument names match the comments.
>>>>>>>> Sure.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>>>> +                                        true);
>>>>>>>>>> +       } else
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>>  uint64_t arch__user_reg_mask(void)
>>>>>>>>>>  {
>>>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>>>> +
>>>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>>>> +                                        true);
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>>>> +                                        true);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>> The code is repetitive here, could we refactor into a single function
>>>>>>>>> passing in a user or instr value?
>>>>>>>> Sure. Would extract the common part.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>  }
>>>>>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>>>>>>>> --- a/tools/perf/util/evsel.c
>>>>>>>>>> +++ b/tools/perf/util/evsel.c
>>>>>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>>>>>>>> index cda1c620968e..0bd100392889 100644
>>>>>>>>>> --- a/tools/perf/util/parse-regs-options.c
>>>>>>>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>>>>>>>> @@ -4,19 +4,139 @@
>>>>>>>>>>  #include <stdint.h>
>>>>>>>>>>  #include <string.h>
>>>>>>>>>>  #include <stdio.h>
>>>>>>>>>> +#include <linux/bitops.h>
>>>>>>>>>>  #include "util/debug.h"
>>>>>>>>>>  #include <subcmd/parse-options.h>
>>>>>>>>>>  #include "util/perf_regs.h"
>>>>>>>>>>  #include "util/parse-regs-options.h"
>>>>>>>>>> +#include "record.h"
>>>>>>>>>> +
>>>>>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       if (!simd_mask)
>>>>>>>>>> +               return;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (!(r->mask & simd_mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               if (bitmap)
>>>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>>>> +       }
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       if (!pred_mask)
>>>>>>>>>> +               return;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (!(r->mask & pred_mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               if (bitmap)
>>>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>>>> +       }
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool matched = false;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               matched = true;
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       /* Just need the highest qwords */
>>>>>>>>> I'm not following here. Does the bitmap need to handle gaps?
>>>>>>>> Currently no. In theory, the kernel supports user space only samples a
>>>>>>>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
>>>>>>>> supports 16 XMM registers on XMM), but it's not supported to avoid
>>>>>>>> introducing too much complexity in perf tools. Moreover, I don't think end
>>>>>>>> users have such requirement. In most cases, users should only know which
>>>>>>>> kinds of SIMD registers their programs use but usually doesn't know and
>>>>>>>> care about which exact SIMD register is used.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>>>>>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>>>>>>>> +               else
>>>>>>>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return matched;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool matched = false;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               matched = true;
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       /* Just need the highest qwords */
>>>>>>>>> Again repetitive, could we have a single function?
>>>>>>>> Yes, I suppose the for loop at least can be extracted as a common function.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>>>>>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>>>>>>>> +               else
>>>>>>>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return matched;
>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>>  static int
>>>>>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>  {
>>>>>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>>>>>>>         const struct sample_reg *r = NULL;
>>>>>>>>>> +       struct record_opts *opts;
>>>>>>>>>>         char *s, *os = NULL, *p;
>>>>>>>>>> -       int ret = -1;
>>>>>>>>>> +       bool has_simd_regs = false;
>>>>>>>>>>         uint64_t mask;
>>>>>>>>>> +       uint64_t simd_mask;
>>>>>>>>>> +       uint64_t pred_mask;
>>>>>>>>>> +       int ret = -1;
>>>>>>>>>>
>>>>>>>>>>         if (unset)
>>>>>>>>>>                 return 0;
>>>>>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>         if (*mode)
>>>>>>>>>>                 return -1;
>>>>>>>>>>
>>>>>>>>>> -       if (intr)
>>>>>>>>>> +       if (intr) {
>>>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>>>>>>>                 mask = arch__intr_reg_mask();
>>>>>>>>>> -       else
>>>>>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>>>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>>>>>>>> +       } else {
>>>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>>>>>>>                 mask = arch__user_reg_mask();
>>>>>>>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>>>>>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>>>>>>>> +       }
>>>>>>>>>>
>>>>>>>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>>>>>>>         if (str) {
>>>>>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>                                         if (r->mask & mask)
>>>>>>>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>>>>>>>                                 }
>>>>>>>>>> +                               __print_simd_regs(intr, simd_mask);
>>>>>>>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>>>>>>>                                 fputc('\n', stderr);
>>>>>>>>>>                                 /* just printing available regs */
>>>>>>>>>>                                 goto error;
>>>>>>>>>>                         }
>>>>>>>>>> +
>>>>>>>>>> +                       if (simd_mask) {
>>>>>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>>>>>>>> +                               if (has_simd_regs)
>>>>>>>>>> +                                       goto next;
>>>>>>>>>> +                       }
>>>>>>>>>> +                       if (pred_mask) {
>>>>>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>>>>>>>> +                               if (has_simd_regs)
>>>>>>>>>> +                                       goto next;
>>>>>>>>>> +                       }
>>>>>>>>>> +
>>>>>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>>>>>>>                                         break;
>>>>>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>                         }
>>>>>>>>>>
>>>>>>>>>>                         *mode |= r->mask;
>>>>>>>>>> -
>>>>>>>>>> +next:
>>>>>>>>>>                         if (!p)
>>>>>>>>>>                                 break;
>>>>>>>>>>
>>>>>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>         ret = 0;
>>>>>>>>>>
>>>>>>>>>>         /* default to all possible regs */
>>>>>>>>>> -       if (*mode == 0)
>>>>>>>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>>>>>>>                 *mode = mask;
>>>>>>>>>>  error:
>>>>>>>>>>         free(os);
>>>>>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>>>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>>>>>>>
>>>>>>>>>>         return ret;
>>>>>>>>>>  }
>>>>>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>>>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>>>>>>>> --- a/tools/perf/util/perf_regs.c
>>>>>>>>>> +++ b/tools/perf/util/perf_regs.c
>>>>>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>>>>>>>         return SDT_ARG_SKIP;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>>>>>>>> +{
>>>>>>>>>> +       return false;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>>>>>>>  {
>>>>>>>>>>         return 0;
>>>>>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>>>>>>>         return 0;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>>>         SMPL_REG_END
>>>>>>>>>>  };
>>>>>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>>>>>>>         return sample_reg_masks;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_reg_masks;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_reg_masks;
>>>>>>>>>> +}
>>>>>>>>> Thinking out loud. I wonder if there is a way to hide the weak
>>>>>>>>> functions. It seems the support is tied to PMUs, particularly core
>>>>>>>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
>>>>>>>>> ask the PMU to parse the register strings, set up the perf_event_attr,
>>>>>>>>> etc. I'm somewhat scared these functions will be used on the report
>>>>>>>>> rather than record side of things, thereby breaking perf.data support
>>>>>>>>> when the host kernel does or doesn't have the SIMD support.
>>>>>>>> Ian, I don't quite follow your words.
>>>>>>>>
>>>>>>>> I don't quite understand how should we do for "push things into pmu and
>>>>>>>> arch pmu code". Current SIMD registers support follows the same way of the
>>>>>>>> general registers support. If we intend to change the way entirely, we'd
>>>>>>>> better have an independent patch-set.
>>>>>>>>
>>>>>>>> why these functions would break the perf.data repport? perf-report would
>>>>>>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
>>>>>>>> the flags is set (indicates there are SIMD registers data appended in the
>>>>>>>> record), perf-report would try to parse the SIMD registers data.
>>>>>>> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
>>>>>>> remove weak symbols like:
>>>>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
>>>>>>>
>>>>>>> For these patches what I'm imagining is that there is a Nova Lake
>>>>>>> generated perf.data file. Using perf report, script, etc. on the Nova
>>>>>>> Lake should expose all of the same mask, qword, etc. values as when
>>>>>>> the perf.data was generated and so things will work. If the perf.data
>>>>>>> file was taken to say my Alderlake then what will happen? Generally
>>>>>>> using the arch directory and weak symbols is a code smell that cross
>>>>>>> platform things are going to break - there should be sufficient data
>>>>>>> in the event and the perf_event_attr to fully decode what's going on.
>>>>>>> Sometimes tying things to a PMU name can avoid the use of the arch
>>>>>>> directory. We were able to avoid the arch directory to a good extent
>>>>>>> for the TPEBS code, even though it is a very modern Intel feature.
>>>>>> I see.
>>>>>>
>>>>>> But the sampling support for SIMD registers is different with the sample
>>>>>> weight processing in the patch
>>>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
>>>>>> Each arch may support different kinds of SIMD registers and furthermore
>>>>>> each kind of SIMD register may have different register number and register
>>>>>> width. It's quite hard to figure out some common functions or fields to
>>>>>> represent the name and attributes of these arch-specific SIMD registers.
>>>>>> These arch-specific information can only be told by the arch-specific code.
>>>>>> So it looks the __weak functions are still the easiest way to implement this.
>>>>>>
>>>>>> I don't think the perf.data parsing would be broken from a platform to
>>>>>> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
>>>>>> To indicates the presence of SIMD registers in record data, a new ABI flag
>>>>>> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
>>>>>> platform is new enough and can recognize this new flag, then the SIMD
>>>>>> registers data would be parsed correctly. Even though the perf tool is old
>>>>>> and have no support of SIMD register, the data of SIMD registers would just
>>>>>> be silently ignored and should not break the parsing.
>>>>> That's good to know. I'm confused then why these functions can't just
>>>>> be within the arch directory? For example, we don't expose the
>>>>> intel-pt PMU code in the common code except for the parsing parts. A
>>>>> lot of that is handled by the default perf_event_attr initialization
>>>>> that every PMU can have its own variant of:
>>>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123
>>>> I see. From my point of view, there seems no essential difference between a
>>>> function pointer and a __weak function, and it looks hard to find a common
>>>> data structure to save all these function pointers which needs to be called
>>>> in different places, like register name parsing, register data dumpling ...
>>>>
>>>>
>>>>> Perhaps this is all just evidence of tech debt in the perf_regs.c code
>>>>> :-/ The bit that's relevant to the patch here is that I think this is
>>>>> adding to the tech debt problem as 11 more functions are added to
>>>>> perf_regs.h.
>>>> Yeah, 11 new __weak functions seems too much, we may merge the same kinds
>>>> of functions, like merging *_simd_reg_mask() and  *_pred_reg_mask() to a
>>>> single function with an type argument, then the new added __weak functions
>>>> could shrink half.
>>> There could be a good reason for 11 weak functions :-) In the
>>> perf_event.h you've added to the sample event:
>>> ```
>>> +        *        u64                   regs[weight(mask)];
>>> +        *        struct {
>>> +        *              u16 nr_vectors;
>>> +        *              u16 vector_qwords;
>>> +        *              u16 nr_pred;
>>> +        *              u16 pred_qwords;
>>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred
>>> * pred_qwords];
>>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>>> +        *      } && PERF_SAMPLE_REGS_USER
>>> ```
>>> so these things are readable/writable outside of builds with arch/x86
>>> compiled in, which is why it seems odd that there needs to be arch
>>> code in the common code to handle them. Similar to how I needed to get
>>> the retirement latency parsing out of the arch/x86 directory as
>>> potentially you could be looking at a perf.data file with retirement
>>> latencies in it on a non-x86 platform.
>> Ian, I'm not sure if I fully get your point. If not, please correct.
>>
>> Although these new introduced fields are generic and existed on all
>> architectures, it's not enough to get all the necessary information to dump
>> or parse the SIMD registers, e.g., the SIMD register name.
>>
>> Let's take dumpling the sampled value of SIMD registers as an example.
>> We know there could be different kinds of SIMD register on different archs,
>> like XMM/YMM/ZMM on x86 and V-registers/Z-registers on ARM.
>>
>> Currently we only know the register number and width from generic fields,
>> we have no way to directly know the exact name this SIMD register
>> corresponds. We have to involve the arch-specific function to figure out it
>> and then print them.
>>
>> At least for now, it looks we still need these arch-specific functions ...
> Thanks Dapeng. I started by thinking out loud, so I'm not saying this
> is something to necessarily fix in the patch series but it probably is
> something that needs to be fixed.
>
> You mention that different archs have different registers and so we
> need different routines for those archs, implying weak symbols, etc.
> We do actually have generic register dumping code in get_dwarf_regstr:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/dwarf-regs.c?h=perf-tools-next#n33
> It takes the dwarf register number, the ELF Ehdr e_machine and for the
> purposes of csky the e_flags. If you want the e_machine for the perf
> binary itself (such as in perf record when you don't yet have a
> perf.data file) there is an EM_HOST value:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/include/dwarf-regs.h?h=perf-tools-next#n27
> Perf has historically used a CPUID string, but I'd like to deprecate
> that in favor of just using e_machine (and possibly e_flags) values.
> We should probably have CPUID string to e_machine convesion utility
> functions and remove cpuid from the perf_env:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/env.h?h=perf-tools-next#n67
> but anyway, my point isn't about the e_machine values.
>
> What I'm trying to say is that weak symbols and code in arch
> inherently means the cross platform development will break. For
> example, before:
> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/
> perf_parse_sample_weight just simply didn't exist outside of PowerPC
> and x86. This meant that the part of the perf event in the perf.data
> containing the sample weights couldn't be parsed on say an ARM64 build
> of perf. This meant the values couldn't even be dumped in perf script.
> The values are, however, described in the cross platform perf sample
> event format, much as the SIMD registers are here.
>
> It seems as we have from a perf.data file at least a CPUID string from
> the header features, a perf_event_attr and the register number, we
> should be able to do something like get_dwarf_regstr. Such a function
> wouldn't be in the arch directory as we wouldn't want to interpret
> registers in events just on x86 platforms (as with the retirement
> latency). If we're not able to do this then there seems to be
> something wrong with the SIMD change and perhaps we need to capture
> more information in the perf.data file header.

Thanks Ian for your detailed explanation. I understood your point right now.

I originally thought there could be no such requirement that parses a
perf.data file in a machine with totally different arch. But it seems there
is as you said.

Then I suppose we need to do same thing for the
perf_reg_value()/perf_simd_reg_value() just like perf_reg_name() does, but
currently the "arch" string comes from perf_env__arch() helper which should
be arch of perf running instead of the arch which is sampled on.

Anyway, I think we can make the retirement of __weak functions as the 1st
step. As for the replacement from cpuid or env->arch to EM_HOST or
something else (I'm not sure how much complex it would be, but suppose it
should not be sample), we'd better to have an independent patchs-set to
implement it since it has no direct relationship with current SIMD
registers sampling support.


>
> Thanks,
> Ian
>
>>> Thanks,
>>> Ian
>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>>> Thanks,
>>>>>>> Ian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ian
>>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>>>>>>>  {
>>>>>>>>>>         const char *reg_name = NULL;
>>>>>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>>>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>>>>>>>> --- a/tools/perf/util/perf_regs.h
>>>>>>>>>> +++ b/tools/perf/util/perf_regs.h
>>>>>>>>>> @@ -24,9 +24,20 @@ enum {
>>>>>>>>>>  };
>>>>>>>>>>
>>>>>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>>>>>>>> +bool arch_has_simd_regs(u64 mask);
>>>>>>>>>>  uint64_t arch__intr_reg_mask(void);
>>>>>>>>>>  uint64_t arch__user_reg_mask(void);
>>>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>
>>>>>>>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>>>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>>>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>>>>>>>> --- a/tools/perf/util/record.h
>>>>>>>>>> +++ b/tools/perf/util/record.h
>>>>>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>>>>>>>         unsigned int  user_freq;
>>>>>>>>>>         u64           branch_stack;
>>>>>>>>>>         u64           sample_intr_regs;
>>>>>>>>>> +       u64           sample_intr_vec_regs;
>>>>>>>>>>         u64           sample_user_regs;
>>>>>>>>>> +       u64           sample_user_vec_regs;
>>>>>>>>>> +       u16           sample_pred_regs_qwords;
>>>>>>>>>> +       u16           sample_vec_regs_qwords;
>>>>>>>>>> +       u16           sample_intr_pred_regs;
>>>>>>>>>> +       u16           sample_user_pred_regs;
>>>>>>>>>>         u64           default_interval;
>>>>>>>>>>         u64           user_interval;
>>>>>>>>>>         size_t        auxtrace_snapshot_size;
>>>>>>>>>> --
>>>>>>>>>> 2.34.1
>>>>>>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 07/19] perf: Add sampling support for SIMD registers
  2025-12-05 11:07   ` Peter Zijlstra
@ 2025-12-08  5:24     ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-08  5:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 7:07 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:48PM +0800, Dapeng Mi wrote:
>
>> @@ -545,6 +547,25 @@ struct perf_event_attr {
>>  	__u64	sig_data;
>>  
>>  	__u64	config3; /* extension of config2 */
>> +
>> +
>> +	/*
>> +	 * Defines set of SIMD registers to dump on samples.
>> +	 * The sample_simd_regs_enabled !=0 implies the
>> +	 * set of SIMD registers is used to config all SIMD registers.
>> +	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
>> +	 * config some SIMD registers on X86.
>> +	 */
>> +	union {
>> +		__u16 sample_simd_regs_enabled;
>> +		__u16 sample_simd_pred_reg_qwords;
>> +	};
>> +	__u32 sample_simd_pred_reg_intr;
>> +	__u32 sample_simd_pred_reg_user;
>> +	__u16 sample_simd_vec_reg_qwords;
>> +	__u64 sample_simd_vec_reg_intr;
>> +	__u64 sample_simd_vec_reg_user;
>> +	__u32 __reserved_4;
>>  };
> This is poorly aligned and causes holes.
>
> This:
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index d292f96bc06f..2deb8dd0ca37 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -545,6 +545,14 @@ struct perf_event_attr {
>  	__u64	sig_data;
>  
>  	__u64	config3; /* extension of config2 */
> +
> +	__u16	sample_simd_pred_reg_qwords;
> +	__u32	sample_simd_pred_reg_intr;
> +	__u32	sample_simd_pred_reg_user;
> +	__u16	sample_simd_vec_reg_qwords;
> +	__u64	sample_simd_vec_reg_intr;
> +	__u64	sample_simd_vec_reg_user;
> +	__u32	__reserved_4;
>  };
>  
>  /*
>
> results in:
>
>         __u64                      config3;              /*   128     8 */
>         __u16                      sample_simd_pred_reg_qwords; /*   136     2 */
>
>         /* XXX 2 bytes hole, try to pack */
>
>         __u32                      sample_simd_pred_reg_intr; /*   140     4 */
>         __u32                      sample_simd_pred_reg_user; /*   144     4 */
>         __u16                      sample_simd_vec_reg_qwords; /*   148     2 */
>
>         /* XXX 2 bytes hole, try to pack */
>
>         __u64                      sample_simd_vec_reg_intr; /*   152     8 */
>         __u64                      sample_simd_vec_reg_user; /*   160     8 */
>         __u32                      __reserved_4;         /*   168     4 */
>
>
>
> A better layout might be:
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index d292f96bc06f..f72707e9df68 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -545,6 +545,15 @@ struct perf_event_attr {
>  	__u64	sig_data;
>  
>  	__u64	config3; /* extension of config2 */
> +
> +	__u16	sample_simd_pred_reg_qwords;
> +	__u16	sample_simd_vec_reg_qwords;
> +	__u32	__reserved_4;
> +
> +	__u32	sample_simd_pred_reg_intr;
> +	__u32	sample_simd_pred_reg_user;
> +	__u64	sample_simd_vec_reg_intr;
> +	__u64	sample_simd_vec_reg_user;
>  };
>  
>  /*
>
> such that:
>
>         __u64                      config3;              /*   128     8 */
>         __u16                      sample_simd_pred_reg_qwords; /*   136     2 */
>         __u16                      sample_simd_vec_reg_qwords; /*   138     2 */
>         __u32                      __reserved_4;         /*   140     4 */
>         __u32                      sample_simd_pred_reg_intr; /*   144     4 */
>         __u32                      sample_simd_pred_reg_user; /*   148     4 */
>         __u64                      sample_simd_vec_reg_intr; /*   152     8 */
>         __u64                      sample_simd_vec_reg_user; /*   160     8 */
>
Sure. Thanks.


>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 07/19] perf: Add sampling support for SIMD registers
  2025-12-05 11:40   ` Peter Zijlstra
@ 2025-12-08  6:00     ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-08  6:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 7:40 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:48PM +0800, Dapeng Mi wrote:
>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 3e9c48fa2202..b19de038979e 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -7469,6 +7469,50 @@ perf_output_sample_regs(struct perf_output_handle *handle,
>>  	}
>>  }
>>  
>> +static void
>> +perf_output_sample_simd_regs(struct perf_output_handle *handle,
>> +			     struct perf_event *event,
>> +			     struct pt_regs *regs,
>> +			     u64 mask, u16 pred_mask)
>> +{
>> +	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
>> +	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
>> +	u64 pred_bitmap = pred_mask;
>> +	u64 bitmap = mask;
>> +	u16 nr_vectors;
>> +	u16 nr_pred;
>> +	int bit;
>> +	u64 val;
>> +	u16 i;
>> +
>> +	nr_vectors = hweight64(bitmap);
>> +	nr_pred = hweight64(pred_bitmap);
>> +
>> +	perf_output_put(handle, nr_vectors);
>> +	perf_output_put(handle, vec_qwords);
>> +	perf_output_put(handle, nr_pred);
>> +	perf_output_put(handle, pred_qwords);
>> +
>> +	if (nr_vectors) {
>> +		for_each_set_bit(bit, (unsigned long *)&bitmap,
> This isn't right. Yes we do this all the time in the x86 code, but there
> we can assume little-endian byte order. This is core code and is also
> used on big-endian systems where this is very much broken.

Oh, yes. Just ignored the endians. Would fix it in next version. Thanks.


>
>> +				 sizeof(bitmap) * BITS_PER_BYTE) {
>> +			for (i = 0; i < vec_qwords; i++) {
>> +				val = perf_simd_reg_value(regs, bit, i, false);
>> +				perf_output_put(handle, val);
>> +			}
>> +		}
>> +	}
>> +	if (nr_pred) {
>> +		for_each_set_bit(bit, (unsigned long *)&pred_bitmap,
>> +				 sizeof(pred_bitmap) * BITS_PER_BYTE) {
>> +			for (i = 0; i < pred_qwords; i++) {
>> +				val = perf_simd_reg_value(regs, bit, i, true);
>> +				perf_output_put(handle, val);
>> +			}
>> +		}
>> +	}
>> +}

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields
  2025-12-05 11:25   ` Peter Zijlstra
@ 2025-12-08  6:10     ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-08  6:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 7:25 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:49PM +0800, Dapeng Mi wrote:
>
>> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
>> index 7c9d2bb3833b..c3862e5fdd6d 100644
>> --- a/arch/x86/include/uapi/asm/perf_regs.h
>> +++ b/arch/x86/include/uapi/asm/perf_regs.h
>> @@ -55,4 +55,21 @@ enum perf_event_x86_regs {
>>  
>>  #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
>>  
>> +enum {
>> +	PERF_REG_X86_XMM,
>> +	PERF_REG_X86_MAX_SIMD_REGS,
>> +};
>> +
>> +enum {
>> +	PERF_X86_SIMD_XMM_REGS      = 16,
>> +	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
>> +};
>> +
>> +#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
>> +
>> +enum {
>> +	PERF_X86_XMM_QWORDS      = 2,
>> +	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
>> +};
>> +
>>  #endif /* _ASM_X86_PERF_REGS_H */
> I don't understand this bit -- the next few patches add to it for YMM
> and ZMM, but what's the point? I don't see why this is needed at all,
> let alone why it needs to be UABI.

Currently these bits are only used in user space perf tools. Let me remove
it from the header perf_regs.h.


>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields
  2025-12-05 12:16   ` Peter Zijlstra
@ 2025-12-08  6:11     ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-08  6:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 8:16 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:53PM +0800, Dapeng Mi wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> This patch enables sampling of APX eGPRs (R16 ~ R31) via the
>> sample_regs_* fields.
>>
>> To sample eGPRs, the sample_simd_regs_enabled field must be set. This
>> allows the spare space (reclaimed from the original XMM space) in the
>> sample_regs_* fields to be used for representing eGPRs.
>>
>> The perf_reg_value() function needs to check if the
>> PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then determine whether
>> to output eGPRs or legacy XMM registers to userspace.
>>
>> The perf_reg_validate() function is enhanced to validate the eGPRs bitmap
>> by adding a new argument, "simd_enabled".
>>
>> Currently, eGPRs sampling is only supported on the x86_64 architecture, as
>> APX is only available on x86_64 platforms.
>>
>> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/arm/kernel/perf_regs.c           |  2 +-
>>  arch/arm64/kernel/perf_regs.c         |  2 +-
>>  arch/csky/kernel/perf_regs.c          |  2 +-
>>  arch/loongarch/kernel/perf_regs.c     |  2 +-
>>  arch/mips/kernel/perf_regs.c          |  2 +-
>>  arch/parisc/kernel/perf_regs.c        |  2 +-
>>  arch/powerpc/perf/perf_regs.c         |  2 +-
>>  arch/riscv/kernel/perf_regs.c         |  2 +-
>>  arch/s390/kernel/perf_regs.c          |  2 +-
> Perhaps split out the part where you modify the arch function interface?

Sure.


>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 13/19] perf/x86: Enable SSP sampling using sample_regs_* fields
  2025-12-05 12:20   ` Peter Zijlstra
@ 2025-12-08  6:21     ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-08  6:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/5/2025 8:20 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:54PM +0800, Dapeng Mi wrote:
>> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
>> index ca242db3720f..c925af4160ad 100644
>> --- a/arch/x86/include/asm/perf_event.h
>> +++ b/arch/x86/include/asm/perf_event.h
>> @@ -729,6 +729,10 @@ struct x86_perf_regs {
>>  		u64	*egpr_regs;
>>  		struct apx_state *egpr;
>>  	};
>> +	union {
>> +		u64	*cet_regs;
>> +		struct cet_user_state *cet;
>> +	};
>>  };
> Are we envisioning more than just SSP?

No idea about this, currently only SSP is supported. 


>
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-05 12:39   ` Peter Zijlstra
  2025-12-07 20:44     ` Andi Kleen
@ 2025-12-08  6:46     ` Mi, Dapeng
  2025-12-08  8:50       ` Peter Zijlstra
  1 sibling, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-08  6:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 12/5/2025 8:39 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 02:54:57PM +0800, Dapeng Mi wrote:
>> When two or more identical PEBS events with the same sampling period are
>> programmed on a mix of PDIST and non-PDIST counters, multiple
>> back-to-back NMIs can be triggered.
> This is a hardware defect -- albeit a fairly common one.
>
>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index da48bcde8fce..a130d3f14844 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -3351,8 +3351,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>>  	 */
>>  	if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
>>  				 (unsigned long *)&status)) {
>> -		handled++;
>> -		static_call(x86_pmu_drain_pebs)(regs, &data);
>> +		handled += static_call(x86_pmu_drain_pebs)(regs, &data);
>>  
>>  		if (cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS] &&
>>  		    is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS]))
> Note that the old code would return handled++, while the new code:
>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index a01c72c03bd6..c7cdcd585574 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -2759,7 +2759,7 @@ __intel_pmu_pebs_events(struct perf_event *event,
>>  	__intel_pmu_pebs_last_event(event, iregs, regs, data, at, count, setup_sample);
>>  }
>>  
>> -static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
>> +static int intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
>>  {
>>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>  	struct debug_store *ds = cpuc->ds;
>> @@ -2768,7 +2768,7 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
>>  	int n;
>>  
>>  	if (!x86_pmu.pebs_active)
>> -		return;
>> +		return 0;
>>  
>>  	at  = (struct pebs_record_core *)(unsigned long)ds->pebs_buffer_base;
>>  	top = (struct pebs_record_core *)(unsigned long)ds->pebs_index;
>> @@ -2779,22 +2779,24 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_
>>  	ds->pebs_index = ds->pebs_buffer_base;
>>  
>>  	if (!test_bit(0, cpuc->active_mask))
>> -		return;
>> +		return 0;
>>  
>>  	WARN_ON_ONCE(!event);
>>  
>>  	if (!event->attr.precise_ip)
>> -		return;
>> +		return 0;
>>  
>>  	n = top - at;
>>  	if (n <= 0) {
>>  		if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
>>  			intel_pmu_save_and_restart_reload(event, 0);
>> -		return;
>> +		return 0;
>>  	}
>>  
>>  	__intel_pmu_pebs_events(event, iregs, data, at, top, 0, n,
>>  				setup_pebs_fixed_sample_data);
>> +
>> +	return 0;
>>  }
>>  
>>  static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64 mask)
>> @@ -2817,7 +2819,7 @@ static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc, u64
>>  	}
>>  }
>>  
>> -static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
>> +static int intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_data *data)
>>  {
>>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>  	struct debug_store *ds = cpuc->ds;
>> @@ -2830,7 +2832,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>>  	u64 mask;
>>  
>>  	if (!x86_pmu.pebs_active)
>> -		return;
>> +		return 0;
>>  
>>  	base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
>>  	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
>> @@ -2846,7 +2848,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>>  
>>  	if (unlikely(base >= top)) {
>>  		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
>> -		return;
>> +		return 0;
>>  	}
>>  
>>  	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
>> @@ -2931,6 +2933,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>>  						setup_pebs_fixed_sample_data);
>>  		}
>>  	}
>> +
>> +	return 0;
>>  }
>>  
>>  static __always_inline void
>> @@ -2984,7 +2988,7 @@ __intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
>>  
>>  }
>>  
>> -static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
>> +static int intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
>>  {
>>  	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
>>  	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
>> @@ -2997,7 +3001,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>>  	u64 mask;
>>  
>>  	if (!x86_pmu.pebs_active)
>> -		return;
>> +		return 0;
>>  
>>  	base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
>>  	top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
>> @@ -3010,7 +3014,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>>  
>>  	if (unlikely(base >= top)) {
>>  		intel_pmu_pebs_event_update_no_drain(cpuc, mask);
>> -		return;
>> +		return 0;
>>  	}
>>  
>>  	if (!iregs)
>> @@ -3032,9 +3036,11 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>>  
>>  	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
>>  					    setup_pebs_adaptive_sample_data);
>> +
>> +	return 0;
>>  }
> will now return handled+=0 for all these. Which is a change in
> behaviour. Also:

This change only take effects for arch-PEBS. For the legacy PEBS, the
"handled" would still be added 1 unconditionally even the *_drain_pebs()
helpers always return 0.

    /*
     * PEBS overflow sets bit 62 in the global status register
     */
    if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long
*)&status)) {
        u64 pebs_enabled = cpuc->pebs_enabled;

        handled++;
        x86_pmu_handle_guest_pebs(regs, &data);
        static_call(x86_pmu_drain_pebs)(regs, &data);


>
>> -static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>> +static int intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>>  				      struct perf_sample_data *data)
>>  {
>>  	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
>> @@ -3044,13 +3050,14 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>>  	struct x86_perf_regs perf_regs;
>>  	struct pt_regs *regs = &perf_regs.regs;
>>  	void *base, *at, *top;
>> +	u64 events_bitmap = 0;
>>  	u64 mask;
>>  
>>  	rdmsrq(MSR_IA32_PEBS_INDEX, index.whole);
>>  
>>  	if (unlikely(!index.wr)) {
>>  		intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
>> -		return;
>> +		return 0;

If index.wr is 0, then it indicates there is no any PEBS record written in
buffer since last drain of PEBS buffer. In this case, the PEBS PMI should
not be generated. If it's generated, then it implies there must be
something wrong. The 0 return value would lead to a "suspicious NMI"
warning which is good to warn us there are something wrong.


>>  	}
>>  
>>  	base = cpuc->pebs_vaddr;
>> @@ -3089,6 +3096,7 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>>  
>>  		basic = at + sizeof(struct arch_pebs_header);
>>  		pebs_status = mask & basic->applicable_counters;
>> +		events_bitmap |= pebs_status;
>>  		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
>>  					       pebs_status, counts, last,
>>  					       setup_arch_pebs_sample_data);
>> @@ -3108,6 +3116,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
>>  	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask,
>>  					    counts, last,
>>  					    setup_arch_pebs_sample_data);
>> +
> 	/*
> 	 * Comment that explains the arch pebs defect goes here.
> 	 */
>> +	return hweight64(events_bitmap);
>>  }

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-08  6:46     ` Mi, Dapeng
@ 2025-12-08  8:50       ` Peter Zijlstra
  2025-12-08  8:53         ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2025-12-08  8:50 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao

On Mon, Dec 08, 2025 at 02:46:44PM +0800, Mi, Dapeng wrote:

> This change only take effects for arch-PEBS. For the legacy PEBS, the
> "handled" would still be added 1 unconditionally even the *_drain_pebs()
> helpers always return 0.
> 
>     /*
>      * PEBS overflow sets bit 62 in the global status register
>      */
>     if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long
> *)&status)) {
>         u64 pebs_enabled = cpuc->pebs_enabled;
> 
>         handled++;
>         x86_pmu_handle_guest_pebs(regs, &data);
>         static_call(x86_pmu_drain_pebs)(regs, &data);
> 

Oh gawd. Please don't do that. If you change the calling convention of
that function, please have it be used consistently.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs
  2025-12-08  8:50       ` Peter Zijlstra
@ 2025-12-08  8:53         ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-08  8:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Dave Hansen, Ian Rogers, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 12/8/2025 4:50 PM, Peter Zijlstra wrote:
> On Mon, Dec 08, 2025 at 02:46:44PM +0800, Mi, Dapeng wrote:
>
>> This change only take effects for arch-PEBS. For the legacy PEBS, the
>> "handled" would still be added 1 unconditionally even the *_drain_pebs()
>> helpers always return 0.
>>
>>     /*
>>      * PEBS overflow sets bit 62 in the global status register
>>      */
>>     if (__test_and_clear_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, (unsigned long
>> *)&status)) {
>>         u64 pebs_enabled = cpuc->pebs_enabled;
>>
>>         handled++;
>>         x86_pmu_handle_guest_pebs(regs, &data);
>>         static_call(x86_pmu_drain_pebs)(regs, &data);
>>
> Oh gawd. Please don't do that. If you change the calling convention of
> that function, please have it be used consistently.

Sure. I would do same change for legacy PEBS and make the behavior consistent. 


>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf
  2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
                   ` (19 preceding siblings ...)
  2025-12-04  0:24 ` [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Ian Rogers
@ 2025-12-16  4:42 ` Ravi Bangoria
  2025-12-16  6:59   ` Mi, Dapeng
  20 siblings, 1 reply; 86+ messages in thread
From: Ravi Bangoria @ 2025-12-16  4:42 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane, Mark Rutland, broonie, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Ravi Bangoria

Hi Dapeng,

> While the hardware solution remains preferable due to its lower
> overhead and higher accuracy, this software approach provides a
> viable alternative.

Lower accuracy in the software approach is due to the delay in an NMI
delivery which will make the SIMD data misaligned a bit? Something like:

   insn1
   insn2  -> Overflow. RIP, GPRs captured by PEBS and NMI triggered
   insn3
   insn4
   insn5  -> NMI delivered here, so SIMD regs are captured here?
   insn6

Am I interpreting it correctly?

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf
  2025-12-16  4:42 ` Ravi Bangoria
@ 2025-12-16  6:59   ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-16  6:59 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane, Mark Rutland, broonie, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao


On 12/16/2025 12:42 PM, Ravi Bangoria wrote:
> Hi Dapeng,
>
>> While the hardware solution remains preferable due to its lower
>> overhead and higher accuracy, this software approach provides a
>> viable alternative.
> Lower accuracy in the software approach is due to the delay in an NMI
> delivery which will make the SIMD data misaligned a bit? Something like:
>
>    insn1
>    insn2  -> Overflow. RIP, GPRs captured by PEBS and NMI triggered
>    insn3
>    insn4
>    insn5  -> NMI delivered here, so SIMD regs are captured here?
>    insn6
>
> Am I interpreting it correctly?

Yes, there is always a delay with software-based (specifically PMI-based)
sampling. Hardware-based sampling like PEBS is preferable when available.


>
> Thanks,
> Ravi

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 13/19] perf/x86: Enable SSP sampling using sample_regs_* fields
  2025-12-03  6:54 ` [Patch v5 13/19] perf/x86: Enable SSP " Dapeng Mi
  2025-12-05 12:20   ` Peter Zijlstra
@ 2025-12-24  5:45   ` Ravi Bangoria
  2025-12-24  6:26     ` Mi, Dapeng
  1 sibling, 1 reply; 86+ messages in thread
From: Ravi Bangoria @ 2025-12-24  5:45 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane, Mark Rutland, broonie, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang, Ravi Bangoria

Hi Dapeng,

> This patch enables sampling of CET SSP register via the sample_regs_*
> fields.
> 
> To sample SSP, the sample_simd_regs_enabled field must be set. This
> allows the spare space (reclaimed from the original XMM space) in the
> sample_regs_* fields to be used for representing SSP.
> 
> Similar with eGPRs sampling, the perf_reg_value() function needs to
> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then
> determine whether to output SSP or legacy XMM registers to userspace.

1. The userspace SSP is saved in REGS_INTR even though interrupt regs
   are of kernel context. Would it be better to pass 0 instead (see the
   _untested_ patch below).

--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -71,7 +71,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
 			}
 			if (idx == PERF_REG_X86_SSP) {
-				if (!perf_regs->cet)
+				if (!perf_regs->cet || !user_mode(regs))
 					return 0;
 				return perf_regs->cet->user_ssp;
 			}

2. Could a simple "--user-regs=ssp / --intr-regs=ssp" (without SIMD/eGPR
   regs) fallback to an RDMSR instead of XSAVE? Possibly as a future
   enhancement if the current patches are already upstream ready.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 13/19] perf/x86: Enable SSP sampling using sample_regs_* fields
  2025-12-24  5:45   ` Ravi Bangoria
@ 2025-12-24  6:26     ` Mi, Dapeng
  2026-01-06  6:55       ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2025-12-24  6:26 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane, Mark Rutland, broonie, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/24/2025 1:45 PM, Ravi Bangoria wrote:
> Hi Dapeng,
>
>> This patch enables sampling of CET SSP register via the sample_regs_*
>> fields.
>>
>> To sample SSP, the sample_simd_regs_enabled field must be set. This
>> allows the spare space (reclaimed from the original XMM space) in the
>> sample_regs_* fields to be used for representing SSP.
>>
>> Similar with eGPRs sampling, the perf_reg_value() function needs to
>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then
>> determine whether to output SSP or legacy XMM registers to userspace.
> 1. The userspace SSP is saved in REGS_INTR even though interrupt regs
>    are of kernel context. Would it be better to pass 0 instead (see the
>    _untested_ patch below).
>
> --- a/arch/x86/kernel/perf_regs.c
> +++ b/arch/x86/kernel/perf_regs.c
> @@ -71,7 +71,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>  				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
>  			}
>  			if (idx == PERF_REG_X86_SSP) {
> -				if (!perf_regs->cet)
> +				if (!perf_regs->cet || !user_mode(regs))

Hmm, I'm not sure if we should add the user_mode() check here. For non-PEBS
case, the SSP value indeed comes from the user space SSP MSR
(MSR_IA32_PL3_SSP) since SSP is not used in kernel now. But for arch-PEBS,
I don't get a clear indication that the SSP value comes from kernel space
SSP (MSR_IA32_PL0_SSP) or the user space SSP (MSR_IA32_PL3_SSP) from the
ISE doc (section 11.4.3 "General-Purpose Register Group"). Let me double
confirm with our HW experts. Thanks for raising this.


>  					return 0;
>  				return perf_regs->cet->user_ssp;
>  			}
>
> 2. Could a simple "--user-regs=ssp / --intr-regs=ssp" (without SIMD/eGPR
>    regs) fallback to an RDMSR instead of XSAVE? Possibly as a future
>    enhancement if the current patches are already upstream ready.

Yeah, good suggestion. Dave ever raised the efficiency concern for using
xsaves to reading SSP
(https://lore.kernel.org/all/3921d500-36ce-409c-8730-6be86a40e334@intel.com/).
I don't see there are security risks by using rdmsr to read SSP value
(Please correct me if it's wrong), I would add an extra patch to implement
this optimization in the tail of next version patch-set. Thanks.


>
> Thanks,
> Ravi

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 13/19] perf/x86: Enable SSP sampling using sample_regs_* fields
  2025-12-24  6:26     ` Mi, Dapeng
@ 2026-01-06  6:55       ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-06  6:55 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Andi Kleen,
	Eranian Stephane, Mark Rutland, broonie, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/24/2025 2:26 PM, Mi, Dapeng wrote:
> On 12/24/2025 1:45 PM, Ravi Bangoria wrote:
>> Hi Dapeng,
>>
>>> This patch enables sampling of CET SSP register via the sample_regs_*
>>> fields.
>>>
>>> To sample SSP, the sample_simd_regs_enabled field must be set. This
>>> allows the spare space (reclaimed from the original XMM space) in the
>>> sample_regs_* fields to be used for representing SSP.
>>>
>>> Similar with eGPRs sampling, the perf_reg_value() function needs to
>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set first, and then
>>> determine whether to output SSP or legacy XMM registers to userspace.
>> 1. The userspace SSP is saved in REGS_INTR even though interrupt regs
>>    are of kernel context. Would it be better to pass 0 instead (see the
>>    _untested_ patch below).
>>
>> --- a/arch/x86/kernel/perf_regs.c
>> +++ b/arch/x86/kernel/perf_regs.c
>> @@ -71,7 +71,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>>  				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
>>  			}
>>  			if (idx == PERF_REG_X86_SSP) {
>> -				if (!perf_regs->cet)
>> +				if (!perf_regs->cet || !user_mode(regs))
> Hmm, I'm not sure if we should add the user_mode() check here. For non-PEBS
> case, the SSP value indeed comes from the user space SSP MSR
> (MSR_IA32_PL3_SSP) since SSP is not used in kernel now. But for arch-PEBS,
> I don't get a clear indication that the SSP value comes from kernel space
> SSP (MSR_IA32_PL0_SSP) or the user space SSP (MSR_IA32_PL3_SSP) from the
> ISE doc (section 11.4.3 "General-Purpose Register Group"). Let me double
> confirm with our HW experts. Thanks for raising this.

Double confirm with the HW experts, PEBS HW engine just snapshots the
active SSP value regardless its privilege level. So when counter overflows
at kernel space, PEBS could snapshot the kernel space SSP.

Since SSP is not enabled in kernel space right now, the kernel space SSP
PEBS snapshots should be 0. But for sanity of safety, would clear the PEBS
SSP data to 0 if it's not a user-space SSP snapshot.


>
>
>>  					return 0;
>>  				return perf_regs->cet->user_ssp;
>>  			}
>>
>> 2. Could a simple "--user-regs=ssp / --intr-regs=ssp" (without SIMD/eGPR
>>    regs) fallback to an RDMSR instead of XSAVE? Possibly as a future
>>    enhancement if the current patches are already upstream ready.
> Yeah, good suggestion. Dave ever raised the efficiency concern for using
> xsaves to reading SSP
> (https://lore.kernel.org/all/3921d500-36ce-409c-8730-6be86a40e334@intel.com/).
> I don't see there are security risks by using rdmsr to read SSP value
> (Please correct me if it's wrong), I would add an extra patch to implement
> this optimization in the tail of next version patch-set. Thanks.
>
Just think twice, I'm not sure if rdmsr would always have better efficiency
than xsaves instruction. xsaves employs the "init" and "modified"
optimizations (although the "modified" optimization won't really be applied
for this perf sampling case) and user space SSP value can be directly
gotten from cached task fpu state if kernel has cached it, instead of
reading SSP by xsaves from hardware.

Anyway, we may need some data. I would try to get some data after the
patch-set refactoring is done.


>> Thanks,
>> Ravi

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-08  4:20                     ` Mi, Dapeng
@ 2026-01-06  7:27                       ` Mi, Dapeng
  2026-01-17  5:50                         ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-06  7:27 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 12/8/2025 12:20 PM, Mi, Dapeng wrote:
> On 12/6/2025 12:35 AM, Ian Rogers wrote:
>> On Fri, Dec 5, 2025 at 12:10 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>> On 12/5/2025 2:38 PM, Ian Rogers wrote:
>>>> On Thu, Dec 4, 2025 at 8:00 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>> On 12/5/2025 12:16 AM, Ian Rogers wrote:
>>>>>> On Thu, Dec 4, 2025 at 1:20 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>>> On 12/4/2025 3:49 PM, Ian Rogers wrote:
>>>>>>>> On Wed, Dec 3, 2025 at 6:58 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>>> On 12/4/2025 8:17 AM, Ian Rogers wrote:
>>>>>>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>>>>>
>>>>>>>>>>> This patch adds support for the newly introduced SIMD register sampling
>>>>>>>>>>> format by adding the following functions:
>>>>>>>>>>>
>>>>>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>>
>>>>>>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>>>>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>>>>>>>>
>>>>>>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>>>>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>>>>>>>>
>>>>>>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>>>>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>>>>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>>>>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>>>>>>>>
>>>>>>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>>>>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>>>>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>>>>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>>>>>>>>> OPMASK).
>>>>>>>>>>>
>>>>>>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>>>>>>>>> these newly introduced SIMD registers. Currently, each type of register
>>>>>>>>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>>>>>>>>> not supported. For example, all XMM registers are sampled together rather
>>>>>>>>>>> than sampling only XMM0.
>>>>>>>>>>>
>>>>>>>>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>>>>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>>>>>>>>
>>>>>>>>>>> With this patch, all supported sampling registers on x86 platforms are
>>>>>>>>>>> displayed as follows.
>>>>>>>>>>>
>>>>>>>>>>>  $perf record -I?
>>>>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>>>>
>>>>>>>>>>>  $perf record --user-regs=?
>>>>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>>>>>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>>>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>>>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>>>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>>>>>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>>>>>>>>  tools/perf/util/record.h                  |   6 +
>>>>>>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>>> index 12fd93f04802..db41430f3b07 100644
>>>>>>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>>> @@ -13,6 +13,49 @@
>>>>>>>>>>>  #include "../../../util/pmu.h"
>>>>>>>>>>>  #include "../../../util/pmus.h"
>>>>>>>>>>>
>>>>>>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>>>>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>>>>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>>>>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>>>>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>>>>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>>>>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>>>>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>>>>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>>>>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>>>>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>>>>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>>>>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>>>>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>>>>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>>>>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>>>>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>>>>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>>>>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>>>>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>>>>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>>>>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>>>>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>>>>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>>>>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>>>>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>>>>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>>>>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>>>>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>>>>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>>>>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>>>>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>>>>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>>>>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>>>>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>>>>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>>>>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>>>>>>>>> +#endif
>>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>>> +};
>>>>>>>>>>> +
>>>>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>>>>>>>>         return SDT_ARG_VALID;
>>>>>>>>>>>  }
>>>>>>>>>>>
>>>>>>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>>>>>>>>> To make the code easier to read, it'd be nice to document sample_type,
>>>>>>>>>> qwords and mask here.
>>>>>>>>> Sure.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> +{
>>>>>>>>>>> +       struct perf_event_attr attr = {
>>>>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>>>>> +               .disabled                       = 1,
>>>>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>>>>> +               .sample_simd_regs_enabled       = 1,
>>>>>>>>>>> +       };
>>>>>>>>>>> +       int fd;
>>>>>>>>>>> +
>>>>>>>>>>> +       attr.sample_period = 1;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (!pred) {
>>>>>>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>>>>>>>>> +               else
>>>>>>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>>>>>>>>> +       } else {
>>>>>>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>>>>>>>>> +               else
>>>>>>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>>>>> +               struct perf_pmu *pmu = NULL;
>>>>>>>>>>> +               __u64 type = PERF_TYPE_RAW;
>>>>>>>>>> It should be okay to do:
>>>>>>>>>> __u64 type = perf_pmus__find_core_pmu()->type
>>>>>>>>>> rather than have the whole loop below.
>>>>>>>>> Sure. Thanks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>> +               /*
>>>>>>>>>>> +                * The same register set is supported among different hybrid PMUs.
>>>>>>>>>>> +                * Only check the first available one.
>>>>>>>>>>> +                */
>>>>>>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>>>>>>>>> +                       type = pmu->type;
>>>>>>>>>>> +                       break;
>>>>>>>>>>> +               }
>>>>>>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       event_attr_init(&attr);
>>>>>>>>>>> +
>>>>>>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>>>>> +       if (fd != -1) {
>>>>>>>>>>> +               close(fd);
>>>>>>>>>>> +               return true;
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       return false;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>>>>> +{
>>>>>>>>>>> +       bool supported = false;
>>>>>>>>>>> +       u64 bits;
>>>>>>>>>>> +
>>>>>>>>>>> +       *mask = 0;
>>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>>> +
>>>>>>>>>>> +       switch (reg) {
>>>>>>>>>>> +       case PERF_REG_X86_XMM:
>>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>>>>>>>>> +               if (supported) {
>>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>>>>>>>>> +               }
>>>>>>>>>>> +               break;
>>>>>>>>>>> +       case PERF_REG_X86_YMM:
>>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>>>>>>>>> +               if (supported) {
>>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>>>>>>>>> +               }
>>>>>>>>>>> +               break;
>>>>>>>>>>> +       case PERF_REG_X86_ZMM:
>>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>>>>> +               if (supported) {
>>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>>>>>>>>> +                       break;
>>>>>>>>>>> +               }
>>>>>>>>>>> +
>>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>>>>> +               if (supported) {
>>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>>>>>>>>> +               }
>>>>>>>>>>> +               break;
>>>>>>>>>>> +       default:
>>>>>>>>>>> +               break;
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       return supported;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>>>>> +{
>>>>>>>>>>> +       bool supported = false;
>>>>>>>>>>> +       u64 bits;
>>>>>>>>>>> +
>>>>>>>>>>> +       *mask = 0;
>>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>>> +
>>>>>>>>>>> +       switch (reg) {
>>>>>>>>>>> +       case PERF_REG_X86_OPMASK:
>>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>>>>>>>>> +               if (supported) {
>>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>>>>> +               }
>>>>>>>>>>> +               break;
>>>>>>>>>>> +       default:
>>>>>>>>>>> +               break;
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       return supported;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static bool has_cap_simd_regs(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>>>>>>>>> +       static bool has_cap_simd_regs;
>>>>>>>>>>> +       static bool cached;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (cached)
>>>>>>>>>>> +               return has_cap_simd_regs;
>>>>>>>>>>> +
>>>>>>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>>>>> +       cached = true;
>>>>>>>>>>> +
>>>>>>>>>>> +       return has_cap_simd_regs;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +bool arch_has_simd_regs(u64 mask)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return has_cap_simd_regs() &&
>>>>>>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>>>>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>>>>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>>>>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>>> +};
>>>>>>>>>>> +
>>>>>>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>>>>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>>> +};
>>>>>>>>>>> +
>>>>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return sample_simd_reg_masks;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return sample_pred_reg_masks;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static bool x86_intr_simd_updated;
>>>>>>>>>>> +static u64 x86_intr_simd_reg_mask;
>>>>>>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>> Could we add some comments? I can kind of figure out the updated is a
>>>>>>>>>> check for lazy initialization and what masks are, qwords is an odd
>>>>>>>>>> one. The comment could also point out that SIMD doesn't mean the
>>>>>>>>>> machine supports SIMD, but SIMD registers are supported in perf
>>>>>>>>>> events.
>>>>>>>>> Sure.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> +static bool x86_user_simd_updated;
>>>>>>>>>>> +static u64 x86_user_simd_reg_mask;
>>>>>>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>>> +
>>>>>>>>>>> +static bool x86_intr_pred_updated;
>>>>>>>>>>> +static u64 x86_intr_pred_reg_mask;
>>>>>>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>>> +static bool x86_user_pred_updated;
>>>>>>>>>>> +static u64 x86_user_pred_reg_mask;
>>>>>>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>>> +
>>>>>>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>>>>>>>>> +{
>>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>>> +       bool supported;
>>>>>>>>>>> +       u64 mask = 0;
>>>>>>>>>>> +       int reg;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>>>>> +               return 0;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>>>>>>>>> +               return x86_intr_simd_reg_mask;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>>>>>>>>> +               return x86_user_simd_reg_mask;
>>>>>>>>>>> +
>>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>>> +               supported = false;
>>>>>>>>>>> +
>>>>>>>>>>> +               if (!r->mask)
>>>>>>>>>>> +                       continue;
>>>>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>>>>> +
>>>>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>>>>>>>>> +                       break;
>>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>>>>> +                                                        &x86_intr_simd_mask[reg],
>>>>>>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>>>>> +                                                        &x86_user_simd_mask[reg],
>>>>>>>>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>>>>>>>>> +               if (supported)
>>>>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>>>>> +               x86_intr_simd_reg_mask = mask;
>>>>>>>>>>> +               x86_intr_simd_updated = true;
>>>>>>>>>>> +       } else {
>>>>>>>>>>> +               x86_user_simd_reg_mask = mask;
>>>>>>>>>>> +               x86_user_simd_updated = true;
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       return mask;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>>>>>>>>> +{
>>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>>> +       bool supported;
>>>>>>>>>>> +       u64 mask = 0;
>>>>>>>>>>> +       int reg;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>>>>> +               return 0;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>>>>>>>>> +               return x86_intr_pred_reg_mask;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>>>>>>>>> +               return x86_user_pred_reg_mask;
>>>>>>>>>>> +
>>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>>> +               supported = false;
>>>>>>>>>>> +
>>>>>>>>>>> +               if (!r->mask)
>>>>>>>>>>> +                       continue;
>>>>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>>>>> +
>>>>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>>>>>>>>> +                       break;
>>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>>>>> +                                                        &x86_intr_pred_mask[reg],
>>>>>>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>>>>> +                                                        &x86_user_pred_mask[reg],
>>>>>>>>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>>>>>>>>> +               if (supported)
>>>>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>>>>> +               x86_intr_pred_reg_mask = mask;
>>>>>>>>>>> +               x86_intr_pred_updated = true;
>>>>>>>>>>> +       } else {
>>>>>>>>>>> +               x86_user_pred_reg_mask = mask;
>>>>>>>>>>> +               x86_user_pred_updated = true;
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       return mask;
>>>>>>>>>>> +}
>>>>>>>>>> This feels repetitive with __arch__simd_reg_mask, could they be
>>>>>>>>>> refactored together?
>>>>>>>>> hmm, it looks we can extract the for loop as a common function. The other
>>>>>>>>> parts are hard to be generalized since they are manipulating different
>>>>>>>>> variables. If we want to generalize them, we have to introduce lots of "if
>>>>>>>>> ... else" branches and that would make code hard to be read.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>>>>> +{
>>>>>>>>>>> +       uint64_t mask = 0;
>>>>>>>>>>> +
>>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>>>>>>>>> +               if (intr) {
>>>>>>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>>>>>>>>> +                       mask = x86_intr_simd_mask[reg];
>>>>>>>>>>> +               } else {
>>>>>>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>>>>>>>>> +                       mask = x86_user_simd_mask[reg];
>>>>>>>>>>> +               }
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       return mask;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>>>>> +{
>>>>>>>>>>> +       uint64_t mask = 0;
>>>>>>>>>>> +
>>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>>>>>>>>> +               if (intr) {
>>>>>>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>>>>>>>>> +                       mask = x86_intr_pred_mask[reg];
>>>>>>>>>>> +               } else {
>>>>>>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>>>>>>>>> +                       mask = x86_user_pred_mask[reg];
>>>>>>>>>>> +               }
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       return mask;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>>> +{
>>>>>>>>>>> +       if (!x86_intr_simd_updated)
>>>>>>>>>>> +               arch__intr_simd_reg_mask();
>>>>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>>> +{
>>>>>>>>>>> +       if (!x86_user_simd_updated)
>>>>>>>>>>> +               arch__user_simd_reg_mask();
>>>>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>>> +{
>>>>>>>>>>> +       if (!x86_intr_pred_updated)
>>>>>>>>>>> +               arch__intr_pred_reg_mask();
>>>>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>>> +{
>>>>>>>>>>> +       if (!x86_user_pred_updated)
>>>>>>>>>>> +               arch__user_pred_reg_mask();
>>>>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>>>>>>>>  {
>>>>>>>>>>> +       if (has_cap_simd_regs())
>>>>>>>>>>> +               return sample_reg_masks_ext;
>>>>>>>>>>>         return sample_reg_masks;
>>>>>>>>>>>  }
>>>>>>>>>>>
>>>>>>>>>>> -uint64_t arch__intr_reg_mask(void)
>>>>>>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>>>>>>>>  {
>>>>>>>>>>>         struct perf_event_attr attr = {
>>>>>>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>>>>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>>>>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>>>>>>>>> -               .precise_ip             = 1,
>>>>>>>>>>> -               .disabled               = 1,
>>>>>>>>>>> -               .exclude_kernel         = 1,
>>>>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>>>>> +               .precise_ip                     = 1,
>>>>>>>>>>> +               .disabled                       = 1,
>>>>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>>>>>>>>         };
>>>>>>>>>>>         int fd;
>>>>>>>>>>>         /*
>>>>>>>>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>>>>>>>>          */
>>>>>>>>>>>         attr.sample_period = 1;
>>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>>> +               attr.sample_regs_intr = mask;
>>>>>>>>>>> +       else
>>>>>>>>>>> +               attr.sample_regs_user = mask;
>>>>>>>>>>>
>>>>>>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>>>>>                 struct perf_pmu *pmu = NULL;
>>>>>>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>>>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>>>>>         if (fd != -1) {
>>>>>>>>>>>                 close(fd);
>>>>>>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>>>>>>>>> +               return mask;
>>>>>>>>>>>         }
>>>>>>>>>>>
>>>>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>>>>> +       return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t arch__intr_reg_mask(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>>>>> +                                        true);
>>>>>>>>>> It's nice to label constant arguments like this something like:
>>>>>>>>>> /*has_simd_regs=*/true);
>>>>>>>>>>
>>>>>>>>>> Tools like clang-tidy even try to enforce the argument names match the comments.
>>>>>>>>> Sure.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>>>>> +                                        true);
>>>>>>>>>>> +       } else
>>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>>>>>>>>> +
>>>>>>>>>>> +       return mask;
>>>>>>>>>>>  }
>>>>>>>>>>>
>>>>>>>>>>>  uint64_t arch__user_reg_mask(void)
>>>>>>>>>>>  {
>>>>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>>>>> +                                        true);
>>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>>>>> +                                        true);
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       return mask;
>>>>>>>>>> The code is repetitive here, could we refactor into a single function
>>>>>>>>>> passing in a user or instr value?
>>>>>>>>> Sure. Would extract the common part.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>  }
>>>>>>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>>>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>>>>>>>>> --- a/tools/perf/util/evsel.c
>>>>>>>>>>> +++ b/tools/perf/util/evsel.c
>>>>>>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>>>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>>>>> +               else
>>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>>>>>         }
>>>>>>>>>>>
>>>>>>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>>>>> +               else
>>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>>>>>         }
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>>>>>>>>> index cda1c620968e..0bd100392889 100644
>>>>>>>>>>> --- a/tools/perf/util/parse-regs-options.c
>>>>>>>>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>>>>>>>>> @@ -4,19 +4,139 @@
>>>>>>>>>>>  #include <stdint.h>
>>>>>>>>>>>  #include <string.h>
>>>>>>>>>>>  #include <stdio.h>
>>>>>>>>>>> +#include <linux/bitops.h>
>>>>>>>>>>>  #include "util/debug.h"
>>>>>>>>>>>  #include <subcmd/parse-options.h>
>>>>>>>>>>>  #include "util/perf_regs.h"
>>>>>>>>>>>  #include "util/parse-regs-options.h"
>>>>>>>>>>> +#include "record.h"
>>>>>>>>>>> +
>>>>>>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>>>>>>>>> +{
>>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>>> +       int reg_idx;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (!simd_mask)
>>>>>>>>>>> +               return;
>>>>>>>>>>> +
>>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>>> +               if (!(r->mask & simd_mask))
>>>>>>>>>>> +                       continue;
>>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>>> +               if (intr)
>>>>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>>> +               else
>>>>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>>> +               if (bitmap)
>>>>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>>>>> +       }
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>>>>>>>>> +{
>>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>>> +       int reg_idx;
>>>>>>>>>>> +
>>>>>>>>>>> +       if (!pred_mask)
>>>>>>>>>>> +               return;
>>>>>>>>>>> +
>>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>>> +               if (!(r->mask & pred_mask))
>>>>>>>>>>> +                       continue;
>>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>>> +               if (intr)
>>>>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>>> +               else
>>>>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>>> +               if (bitmap)
>>>>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>>>>> +       }
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>>>>> +{
>>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>>> +       bool matched = false;
>>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>>> +       int reg_idx;
>>>>>>>>>>> +
>>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>>>>> +                       continue;
>>>>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>>>>> +                       continue;
>>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>>> +               if (intr)
>>>>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>>> +               else
>>>>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>>> +               matched = true;
>>>>>>>>>>> +               break;
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       /* Just need the highest qwords */
>>>>>>>>>> I'm not following here. Does the bitmap need to handle gaps?
>>>>>>>>> Currently no. In theory, the kernel supports user space only samples a
>>>>>>>>> subset of SIMD registers, e.g., 0xff or 0xf0f for XMM registers (HW
>>>>>>>>> supports 16 XMM registers on XMM), but it's not supported to avoid
>>>>>>>>> introducing too much complexity in perf tools. Moreover, I don't think end
>>>>>>>>> users have such requirement. In most cases, users should only know which
>>>>>>>>> kinds of SIMD registers their programs use but usually doesn't know and
>>>>>>>>> care about which exact SIMD register is used.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>>>>>>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>>>>>>>>> +               if (intr)
>>>>>>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>>>>>>>>> +               else
>>>>>>>>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       return matched;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>>>>> +{
>>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>>> +       bool matched = false;
>>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>>> +       int reg_idx;
>>>>>>>>>>> +
>>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>>>>> +                       continue;
>>>>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>>>>> +                       continue;
>>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>>> +               if (intr)
>>>>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>>> +               else
>>>>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>>> +               matched = true;
>>>>>>>>>>> +               break;
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       /* Just need the highest qwords */
>>>>>>>>>> Again repetitive, could we have a single function?
>>>>>>>>> Yes, I suppose the for loop at least can be extracted as a common function.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>>>>>>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>>>>>>>>> +               if (intr)
>>>>>>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>>>>>>>>> +               else
>>>>>>>>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>>>>>>>>> +       }
>>>>>>>>>>> +
>>>>>>>>>>> +       return matched;
>>>>>>>>>>> +}
>>>>>>>>>>>
>>>>>>>>>>>  static int
>>>>>>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>>  {
>>>>>>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>>>>>>>>         const struct sample_reg *r = NULL;
>>>>>>>>>>> +       struct record_opts *opts;
>>>>>>>>>>>         char *s, *os = NULL, *p;
>>>>>>>>>>> -       int ret = -1;
>>>>>>>>>>> +       bool has_simd_regs = false;
>>>>>>>>>>>         uint64_t mask;
>>>>>>>>>>> +       uint64_t simd_mask;
>>>>>>>>>>> +       uint64_t pred_mask;
>>>>>>>>>>> +       int ret = -1;
>>>>>>>>>>>
>>>>>>>>>>>         if (unset)
>>>>>>>>>>>                 return 0;
>>>>>>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>>         if (*mode)
>>>>>>>>>>>                 return -1;
>>>>>>>>>>>
>>>>>>>>>>> -       if (intr)
>>>>>>>>>>> +       if (intr) {
>>>>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>>>>>>>>                 mask = arch__intr_reg_mask();
>>>>>>>>>>> -       else
>>>>>>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>>>>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>>>>>>>>> +       } else {
>>>>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>>>>>>>>                 mask = arch__user_reg_mask();
>>>>>>>>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>>>>>>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>>>>>>>>> +       }
>>>>>>>>>>>
>>>>>>>>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>>>>>>>>         if (str) {
>>>>>>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>>                                         if (r->mask & mask)
>>>>>>>>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>>>>>>>>                                 }
>>>>>>>>>>> +                               __print_simd_regs(intr, simd_mask);
>>>>>>>>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>>>>>>>>                                 fputc('\n', stderr);
>>>>>>>>>>>                                 /* just printing available regs */
>>>>>>>>>>>                                 goto error;
>>>>>>>>>>>                         }
>>>>>>>>>>> +
>>>>>>>>>>> +                       if (simd_mask) {
>>>>>>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>>>>>>>>> +                               if (has_simd_regs)
>>>>>>>>>>> +                                       goto next;
>>>>>>>>>>> +                       }
>>>>>>>>>>> +                       if (pred_mask) {
>>>>>>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>>>>>>>>> +                               if (has_simd_regs)
>>>>>>>>>>> +                                       goto next;
>>>>>>>>>>> +                       }
>>>>>>>>>>> +
>>>>>>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>>>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>>>>>>>>                                         break;
>>>>>>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>>                         }
>>>>>>>>>>>
>>>>>>>>>>>                         *mode |= r->mask;
>>>>>>>>>>> -
>>>>>>>>>>> +next:
>>>>>>>>>>>                         if (!p)
>>>>>>>>>>>                                 break;
>>>>>>>>>>>
>>>>>>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>>         ret = 0;
>>>>>>>>>>>
>>>>>>>>>>>         /* default to all possible regs */
>>>>>>>>>>> -       if (*mode == 0)
>>>>>>>>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>>>>>>>>                 *mode = mask;
>>>>>>>>>>>  error:
>>>>>>>>>>>         free(os);
>>>>>>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>>>>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>>>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>>>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>>>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>>>>>>>>
>>>>>>>>>>>         return ret;
>>>>>>>>>>>  }
>>>>>>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>>>>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>>>>>>>>> --- a/tools/perf/util/perf_regs.c
>>>>>>>>>>> +++ b/tools/perf/util/perf_regs.c
>>>>>>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>>>>>>>>         return SDT_ARG_SKIP;
>>>>>>>>>>>  }
>>>>>>>>>>>
>>>>>>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return false;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>>>>>>>>  {
>>>>>>>>>>>         return 0;
>>>>>>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>>>>>>>>         return 0;
>>>>>>>>>>>  }
>>>>>>>>>>>
>>>>>>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>>>>> +{
>>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>>> +       return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>>>>> +{
>>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>>> +       return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>>>>> +{
>>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>>> +       return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>>>>> +{
>>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>>> +       return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>>>>         SMPL_REG_END
>>>>>>>>>>>  };
>>>>>>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>>>>>>>>         return sample_reg_masks;
>>>>>>>>>>>  }
>>>>>>>>>>>
>>>>>>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return sample_reg_masks;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +       return sample_reg_masks;
>>>>>>>>>>> +}
>>>>>>>>>> Thinking out loud. I wonder if there is a way to hide the weak
>>>>>>>>>> functions. It seems the support is tied to PMUs, particularly core
>>>>>>>>>> PMUs, perhaps we can push things into pmu and arch pmu code. Then we
>>>>>>>>>> ask the PMU to parse the register strings, set up the perf_event_attr,
>>>>>>>>>> etc. I'm somewhat scared these functions will be used on the report
>>>>>>>>>> rather than record side of things, thereby breaking perf.data support
>>>>>>>>>> when the host kernel does or doesn't have the SIMD support.
>>>>>>>>> Ian, I don't quite follow your words.
>>>>>>>>>
>>>>>>>>> I don't quite understand how should we do for "push things into pmu and
>>>>>>>>> arch pmu code". Current SIMD registers support follows the same way of the
>>>>>>>>> general registers support. If we intend to change the way entirely, we'd
>>>>>>>>> better have an independent patch-set.
>>>>>>>>>
>>>>>>>>> why these functions would break the perf.data repport? perf-report would
>>>>>>>>> check if the PERF_SAMPLE_REGS_ABI_SIMD flag is set for each record, only
>>>>>>>>> the flags is set (indicates there are SIMD registers data appended in the
>>>>>>>>> record), perf-report would try to parse the SIMD registers data.
>>>>>>>> Thanks Dapeng, sorry I wasn't clear. So, I've landed clean ups to
>>>>>>>> remove weak symbols like:
>>>>>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t
>>>>>>>>
>>>>>>>> For these patches what I'm imagining is that there is a Nova Lake
>>>>>>>> generated perf.data file. Using perf report, script, etc. on the Nova
>>>>>>>> Lake should expose all of the same mask, qword, etc. values as when
>>>>>>>> the perf.data was generated and so things will work. If the perf.data
>>>>>>>> file was taken to say my Alderlake then what will happen? Generally
>>>>>>>> using the arch directory and weak symbols is a code smell that cross
>>>>>>>> platform things are going to break - there should be sufficient data
>>>>>>>> in the event and the perf_event_attr to fully decode what's going on.
>>>>>>>> Sometimes tying things to a PMU name can avoid the use of the arch
>>>>>>>> directory. We were able to avoid the arch directory to a good extent
>>>>>>>> for the TPEBS code, even though it is a very modern Intel feature.
>>>>>>> I see.
>>>>>>>
>>>>>>> But the sampling support for SIMD registers is different with the sample
>>>>>>> weight processing in the patch
>>>>>>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/#t.
>>>>>>> Each arch may support different kinds of SIMD registers and furthermore
>>>>>>> each kind of SIMD register may have different register number and register
>>>>>>> width. It's quite hard to figure out some common functions or fields to
>>>>>>> represent the name and attributes of these arch-specific SIMD registers.
>>>>>>> These arch-specific information can only be told by the arch-specific code.
>>>>>>> So it looks the __weak functions are still the easiest way to implement this.
>>>>>>>
>>>>>>> I don't think the perf.data parsing would be broken from a platform to
>>>>>>> another different platform (same arch), e.g., from Nova Lake to Alder Lake.
>>>>>>> To indicates the presence of SIMD registers in record data, a new ABI flag
>>>>>>> "PERF_SAMPLE_REGS_ABI_SIMD" is introduced. If the perf tool on the 2nd
>>>>>>> platform is new enough and can recognize this new flag, then the SIMD
>>>>>>> registers data would be parsed correctly. Even though the perf tool is old
>>>>>>> and have no support of SIMD register, the data of SIMD registers would just
>>>>>>> be silently ignored and should not break the parsing.
>>>>>> That's good to know. I'm confused then why these functions can't just
>>>>>> be within the arch directory? For example, we don't expose the
>>>>>> intel-pt PMU code in the common code except for the parsing parts. A
>>>>>> lot of that is handled by the default perf_event_attr initialization
>>>>>> that every PMU can have its own variant of:
>>>>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n123
>>>>> I see. From my point of view, there seems no essential difference between a
>>>>> function pointer and a __weak function, and it looks hard to find a common
>>>>> data structure to save all these function pointers which needs to be called
>>>>> in different places, like register name parsing, register data dumpling ...
>>>>>
>>>>>
>>>>>> Perhaps this is all just evidence of tech debt in the perf_regs.c code
>>>>>> :-/ The bit that's relevant to the patch here is that I think this is
>>>>>> adding to the tech debt problem as 11 more functions are added to
>>>>>> perf_regs.h.
>>>>> Yeah, 11 new __weak functions seems too much, we may merge the same kinds
>>>>> of functions, like merging *_simd_reg_mask() and  *_pred_reg_mask() to a
>>>>> single function with an type argument, then the new added __weak functions
>>>>> could shrink half.
>>>> There could be a good reason for 11 weak functions :-) In the
>>>> perf_event.h you've added to the sample event:
>>>> ```
>>>> +        *        u64                   regs[weight(mask)];
>>>> +        *        struct {
>>>> +        *              u16 nr_vectors;
>>>> +        *              u16 vector_qwords;
>>>> +        *              u16 nr_pred;
>>>> +        *              u16 pred_qwords;
>>>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred
>>>> * pred_qwords];
>>>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>>>> +        *      } && PERF_SAMPLE_REGS_USER
>>>> ```
>>>> so these things are readable/writable outside of builds with arch/x86
>>>> compiled in, which is why it seems odd that there needs to be arch
>>>> code in the common code to handle them. Similar to how I needed to get
>>>> the retirement latency parsing out of the arch/x86 directory as
>>>> potentially you could be looking at a perf.data file with retirement
>>>> latencies in it on a non-x86 platform.
>>> Ian, I'm not sure if I fully get your point. If not, please correct.
>>>
>>> Although these new introduced fields are generic and existed on all
>>> architectures, it's not enough to get all the necessary information to dump
>>> or parse the SIMD registers, e.g., the SIMD register name.
>>>
>>> Let's take dumpling the sampled value of SIMD registers as an example.
>>> We know there could be different kinds of SIMD register on different archs,
>>> like XMM/YMM/ZMM on x86 and V-registers/Z-registers on ARM.
>>>
>>> Currently we only know the register number and width from generic fields,
>>> we have no way to directly know the exact name this SIMD register
>>> corresponds. We have to involve the arch-specific function to figure out it
>>> and then print them.
>>>
>>> At least for now, it looks we still need these arch-specific functions ...
>> Thanks Dapeng. I started by thinking out loud, so I'm not saying this
>> is something to necessarily fix in the patch series but it probably is
>> something that needs to be fixed.
>>
>> You mention that different archs have different registers and so we
>> need different routines for those archs, implying weak symbols, etc.
>> We do actually have generic register dumping code in get_dwarf_regstr:
>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/dwarf-regs.c?h=perf-tools-next#n33
>> It takes the dwarf register number, the ELF Ehdr e_machine and for the
>> purposes of csky the e_flags. If you want the e_machine for the perf
>> binary itself (such as in perf record when you don't yet have a
>> perf.data file) there is an EM_HOST value:
>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/include/dwarf-regs.h?h=perf-tools-next#n27
>> Perf has historically used a CPUID string, but I'd like to deprecate
>> that in favor of just using e_machine (and possibly e_flags) values.
>> We should probably have CPUID string to e_machine convesion utility
>> functions and remove cpuid from the perf_env:
>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/env.h?h=perf-tools-next#n67
>> but anyway, my point isn't about the e_machine values.
>>
>> What I'm trying to say is that weak symbols and code in arch
>> inherently means the cross platform development will break. For
>> example, before:
>> https://lore.kernel.org/lkml/20250724163302.596743-21-irogers@google.com/
>> perf_parse_sample_weight just simply didn't exist outside of PowerPC
>> and x86. This meant that the part of the perf event in the perf.data
>> containing the sample weights couldn't be parsed on say an ARM64 build
>> of perf. This meant the values couldn't even be dumped in perf script.
>> The values are, however, described in the cross platform perf sample
>> event format, much as the SIMD registers are here.
>>
>> It seems as we have from a perf.data file at least a CPUID string from
>> the header features, a perf_event_attr and the register number, we
>> should be able to do something like get_dwarf_regstr. Such a function
>> wouldn't be in the arch directory as we wouldn't want to interpret
>> registers in events just on x86 platforms (as with the retirement
>> latency). If we're not able to do this then there seems to be
>> something wrong with the SIMD change and perhaps we need to capture
>> more information in the perf.data file header.
> Thanks Ian for your detailed explanation. I understood your point right now.
>
> I originally thought there could be no such requirement that parses a
> perf.data file in a machine with totally different arch. But it seems there
> is as you said.
>
> Then I suppose we need to do same thing for the
> perf_reg_value()/perf_simd_reg_value() just like perf_reg_name() does, but
> currently the "arch" string comes from perf_env__arch() helper which should
> be arch of perf running instead of the arch which is sampled on.
>
> Anyway, I think we can make the retirement of __weak functions as the 1st
> step. As for the replacement from cpuid or env->arch to EM_HOST or
> something else (I'm not sure how much complex it would be, but suppose it
> should not be sample), we'd better to have an independent patchs-set to
> implement it since it has no direct relationship with current SIMD
> registers sampling support.

Ian,

I looked at these perf regs __weak helpers again, like
arch__intr_reg_mask()/arch__user_reg_mask(). It could be really hard to
eliminate these __weak helpers and convert them into a generic function
like perf_reg_name(). All these __weak helpers are arch-dependent and
usually need to call perf_event_open sysctrl to get the required registers
mask. So even we convert them into a generic function, we still have no way
to get the registers mask of a different arch, like get x86 registers mask
on arm machine. Another reason is that these __weak helpers may contain
some arch-specific instructions. If we want to convert them into a general
perf function like perf_reg_name(). It may cause building error since these
arch-specific instructions may not exist on the building machine.



>
>
>> Thanks,
>> Ian
>>
>>>> Thanks,
>>>> Ian
>>>>
>>>>>> Thanks,
>>>>>> Ian
>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ian
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ian
>>>>>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>>>>>>>>  {
>>>>>>>>>>>         const char *reg_name = NULL;
>>>>>>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>>>>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>>>>>>>>> --- a/tools/perf/util/perf_regs.h
>>>>>>>>>>> +++ b/tools/perf/util/perf_regs.h
>>>>>>>>>>> @@ -24,9 +24,20 @@ enum {
>>>>>>>>>>>  };
>>>>>>>>>>>
>>>>>>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>>>>>>>>> +bool arch_has_simd_regs(u64 mask);
>>>>>>>>>>>  uint64_t arch__intr_reg_mask(void);
>>>>>>>>>>>  uint64_t arch__user_reg_mask(void);
>>>>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>>
>>>>>>>>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>>>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>>>>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>>>>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>>>>>>>>> --- a/tools/perf/util/record.h
>>>>>>>>>>> +++ b/tools/perf/util/record.h
>>>>>>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>>>>>>>>         unsigned int  user_freq;
>>>>>>>>>>>         u64           branch_stack;
>>>>>>>>>>>         u64           sample_intr_regs;
>>>>>>>>>>> +       u64           sample_intr_vec_regs;
>>>>>>>>>>>         u64           sample_user_regs;
>>>>>>>>>>> +       u64           sample_user_vec_regs;
>>>>>>>>>>> +       u16           sample_pred_regs_qwords;
>>>>>>>>>>> +       u16           sample_vec_regs_qwords;
>>>>>>>>>>> +       u16           sample_intr_pred_regs;
>>>>>>>>>>> +       u16           sample_user_pred_regs;
>>>>>>>>>>>         u64           default_interval;
>>>>>>>>>>>         u64           user_interval;
>>>>>>>>>>>         size_t        auxtrace_snapshot_size;
>>>>>>>>>>> --
>>>>>>>>>>> 2.34.1
>>>>>>>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-06  7:27                       ` Mi, Dapeng
@ 2026-01-17  5:50                         ` Ian Rogers
  2026-01-19  6:55                           ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-17  5:50 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Mon, Jan 5, 2026 at 11:27 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> Ian,
>
> I looked at these perf regs __weak helpers again, like
> arch__intr_reg_mask()/arch__user_reg_mask(). It could be really hard to
> eliminate these __weak helpers and convert them into a generic function
> like perf_reg_name(). All these __weak helpers are arch-dependent and
> usually need to call perf_event_open sysctrl to get the required registers
> mask. So even we convert them into a generic function, we still have no way
> to get the registers mask of a different arch, like get x86 registers mask
> on arm machine. Another reason is that these __weak helpers may contain
> some arch-specific instructions. If we want to convert them into a general
> perf function like perf_reg_name(). It may cause building error since these
> arch-specific instructions may not exist on the building machine.

Hi Dapeng,

There was already a patch to better support cross architecture
libdw-unwind-ing and I've just sent out a series to clean this up so
that this is achieved by having mapping functions between perf and
dwarf register names. The functions use the e_machine of the binary to
determine how to map, etc. The series is here:
https://lore.kernel.org/lkml/20260117052849.2205545-1-irogers@google.com/
and I think it can be the foundation for avoiding the weak functions.

I also noticed that I think we're sampling the XMM registers for dwarf
unwinding, but it seems unlikely the XMM registers will hold stack
frame information - so this is probably an x86 inefficiency.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-17  5:50                         ` Ian Rogers
@ 2026-01-19  6:55                           ` Mi, Dapeng
  2026-01-19 20:25                             ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-19  6:55 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/17/2026 1:50 PM, Ian Rogers wrote:
> On Mon, Jan 5, 2026 at 11:27 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>> Ian,
>>
>> I looked at these perf regs __weak helpers again, like
>> arch__intr_reg_mask()/arch__user_reg_mask(). It could be really hard to
>> eliminate these __weak helpers and convert them into a generic function
>> like perf_reg_name(). All these __weak helpers are arch-dependent and
>> usually need to call perf_event_open sysctrl to get the required registers
>> mask. So even we convert them into a generic function, we still have no way
>> to get the registers mask of a different arch, like get x86 registers mask
>> on arm machine. Another reason is that these __weak helpers may contain
>> some arch-specific instructions. If we want to convert them into a general
>> perf function like perf_reg_name(). It may cause building error since these
>> arch-specific instructions may not exist on the building machine.
> Hi Dapeng,
>
> There was already a patch to better support cross architecture
> libdw-unwind-ing and I've just sent out a series to clean this up so
> that this is achieved by having mapping functions between perf and
> dwarf register names. The functions use the e_machine of the binary to
> determine how to map, etc. The series is here:
> https://lore.kernel.org/lkml/20260117052849.2205545-1-irogers@google.com/
> and I think it can be the foundation for avoiding the weak functions.

Hi Ian,

Thanks for the reference patch. But they are different. The reference
patches mainly parse the regs from perf.data and the __weak functions can
be eliminated in the parsing phase since the registers bitmap is fixed for
a fixed arch. While these __weak functions
arch__intr_reg_mask()/arch__user_reg_mask() are used to obtain the support
sampling registers on a specific platform.

We know different platforms even for same arch may support different
registers, e.g., some x86 platforms may only support XMM registers, but
some others may support XMM/YMM/ZMM registers, then all these arch-specific
arch__intr_reg_mask()/arch__user_reg_mask() functions have to depend on the
perf_event_open() syscall to retrieve the supported registers mask from kernel.

Thus, it becomes impossible to retrieve the supported registers mask for a
x86 specific platform from running on a arm platform.

Even we don't consider this limitation and forcibly convert the
__weak arch__intr_reg_mask() function to some kind of below function, just
like currently what perf_reg_name() does.

uint64_t perf_intr_reg_mask(const char *arch)
{
    uint64_t mask = 0;

    if (!strcmp(arch, "csky"))
        mask = perf_intr_reg_mask_csky(id);
    else if (!strcmp(arch, "loongarch"))
        mask = perf_intr_reg_mask_loongarch(id);
    else if (!strcmp(arch, "mips"))
        mask = perf_intr_reg_mask_mips(id);
    else if (!strcmp(arch, "powerpc"))
        mask = perf_intr_reg_mask_powerpc(id);
    else if (!strcmp(arch, "riscv"))
        mask = perf_intr_reg_mask_riscv(id);
    else if (!strcmp(arch, "s390"))
        mask = perf_intr_reg_mask_s390(id);
    else if (!strcmp(arch, "x86"))
        mask = perf_intr_reg_mask_x86(id);
    else if (!strcmp(arch, "arm"))
        mask = perf_intr_reg_mask_arm(id);
    else if (!strcmp(arch, "arm64"))
        mask = perf_intr_reg_mask_arm64(id);

    return mask;
}

But currently there are some arch-dependent instructions in these
arch-specific instructions, like the below code in powerpc specific
arch__intr_reg_mask().

    version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);

mfspr is a powerpc specific instruction, building this converted
perf_intr_reg_mask on non-powerpc platform would lead to building error.

-Dapeng Mi

>
> I also noticed that I think we're sampling the XMM registers for dwarf
> unwinding, but it seems unlikely the XMM registers will hold stack
> frame information - so this is probably an x86 inefficiency.
>
> Thanks,
> Ian
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-19  6:55                           ` Mi, Dapeng
@ 2026-01-19 20:25                             ` Ian Rogers
  2026-01-20  3:04                               ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-19 20:25 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Sun, Jan 18, 2026 at 10:55 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 1/17/2026 1:50 PM, Ian Rogers wrote:
> > On Mon, Jan 5, 2026 at 11:27 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >> Ian,
> >>
> >> I looked at these perf regs __weak helpers again, like
> >> arch__intr_reg_mask()/arch__user_reg_mask(). It could be really hard to
> >> eliminate these __weak helpers and convert them into a generic function
> >> like perf_reg_name(). All these __weak helpers are arch-dependent and
> >> usually need to call perf_event_open sysctrl to get the required registers
> >> mask. So even we convert them into a generic function, we still have no way
> >> to get the registers mask of a different arch, like get x86 registers mask
> >> on arm machine. Another reason is that these __weak helpers may contain
> >> some arch-specific instructions. If we want to convert them into a general
> >> perf function like perf_reg_name(). It may cause building error since these
> >> arch-specific instructions may not exist on the building machine.
> > Hi Dapeng,
> >
> > There was already a patch to better support cross architecture
> > libdw-unwind-ing and I've just sent out a series to clean this up so
> > that this is achieved by having mapping functions between perf and
> > dwarf register names. The functions use the e_machine of the binary to
> > determine how to map, etc. The series is here:
> > https://lore.kernel.org/lkml/20260117052849.2205545-1-irogers@google.com/
> > and I think it can be the foundation for avoiding the weak functions.
>
> Hi Ian,
>
> Thanks for the reference patch. But they are different. The reference
> patches mainly parse the regs from perf.data and the __weak functions can
> be eliminated in the parsing phase since the registers bitmap is fixed for
> a fixed arch. While these __weak functions
> arch__intr_reg_mask()/arch__user_reg_mask() are used to obtain the support
> sampling registers on a specific platform.
>
> We know different platforms even for same arch may support different
> registers, e.g., some x86 platforms may only support XMM registers, but
> some others may support XMM/YMM/ZMM registers, then all these arch-specific
> arch__intr_reg_mask()/arch__user_reg_mask() functions have to depend on the
> perf_event_open() syscall to retrieve the supported registers mask from kernel.
>
> Thus, it becomes impossible to retrieve the supported registers mask for a
> x86 specific platform from running on a arm platform.
>
> Even we don't consider this limitation and forcibly convert the
> __weak arch__intr_reg_mask() function to some kind of below function, just
> like currently what perf_reg_name() does.
>
> uint64_t perf_intr_reg_mask(const char *arch)
> {
>     uint64_t mask = 0;
>
>     if (!strcmp(arch, "csky"))
>         mask = perf_intr_reg_mask_csky(id);
>     else if (!strcmp(arch, "loongarch"))
>         mask = perf_intr_reg_mask_loongarch(id);
>     else if (!strcmp(arch, "mips"))
>         mask = perf_intr_reg_mask_mips(id);
>     else if (!strcmp(arch, "powerpc"))
>         mask = perf_intr_reg_mask_powerpc(id);
>     else if (!strcmp(arch, "riscv"))
>         mask = perf_intr_reg_mask_riscv(id);
>     else if (!strcmp(arch, "s390"))
>         mask = perf_intr_reg_mask_s390(id);
>     else if (!strcmp(arch, "x86"))
>         mask = perf_intr_reg_mask_x86(id);
>     else if (!strcmp(arch, "arm"))
>         mask = perf_intr_reg_mask_arm(id);
>     else if (!strcmp(arch, "arm64"))
>         mask = perf_intr_reg_mask_arm64(id);
>
>     return mask;
> }
>
> But currently there are some arch-dependent instructions in these
> arch-specific instructions, like the below code in powerpc specific
> arch__intr_reg_mask().
>
>     version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);
>
> mfspr is a powerpc specific instruction, building this converted
> perf_intr_reg_mask on non-powerpc platform would lead to building error.

Hi Dapeng,

So my main point is the arch directory and ifdefs, how do they differ
from writing code that uses the ELF machine? For example, your code
uses the arch/x86 directory and has ifdefs on
HAVE_ARCH_X86_64_SUPPORT. How is that different from:
```
switch(e_machine) {
case EM_X86_64:
...
case EM_I386:
...
default:
return 0;
}
```
If we need to determine for the current running machine then e_machine
can equal EM_HOST that is set up for this purpose.

I agree that determining features needs calls that may not be
supported on other architectures. That should yield EOPNOTSUPP and we
can use information like that to populate generic information like the
PMU missing features:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n190
we also probe API support with:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/perf_api_probe.h?h=perf-tools-next

The current code doing lots of string comparisons is unnecessary
overhead and imprecise (x86 is used for both 32-bit and 64-bit x86).
It is removed in the series I linked to, I think we can eventually get
rid of the whole arch string for similar reasons of trying to minimize
the use of the arch directory. I'm curious what happens with APX, will
the e_machine change? We may need to pass in the sample regs_dump's
abi field for cases like this.

My point on the unwinding is that the sample register mask appears to
be set up the same regardless, whereas for stack samples
(--call-graph=dwarf) maybe just sample IP and SP suffices. So perhaps
there should be additional registers to set up the sample mask.

By avoiding the arch functions we can avoid the problem of broken
cross architecture support, we can also lay the groundwork for support
on different architectures that may want to do similar things. I agree
that doesn't matter until >1 architecture is trying to have more
register masks, my concern is trying to keep the code generic and
trying to make sure cross architecture is working. New weak functions
is going in the opposite direction to that.

Thanks,
Ian

> -Dapeng Mi
>
> >
> > I also noticed that I think we're sampling the XMM registers for dwarf
> > unwinding, but it seems unlikely the XMM registers will hold stack
> > frame information - so this is probably an x86 inefficiency.
> >
> > Thanks,
> > Ian
> >

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-19 20:25                             ` Ian Rogers
@ 2026-01-20  3:04                               ` Mi, Dapeng
  2026-01-20  5:16                                 ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-20  3:04 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/20/2026 4:25 AM, Ian Rogers wrote:
> On Sun, Jan 18, 2026 at 10:55 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 1/17/2026 1:50 PM, Ian Rogers wrote:
>>> On Mon, Jan 5, 2026 at 11:27 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> Ian,
>>>>
>>>> I looked at these perf regs __weak helpers again, like
>>>> arch__intr_reg_mask()/arch__user_reg_mask(). It could be really hard to
>>>> eliminate these __weak helpers and convert them into a generic function
>>>> like perf_reg_name(). All these __weak helpers are arch-dependent and
>>>> usually need to call perf_event_open sysctrl to get the required registers
>>>> mask. So even we convert them into a generic function, we still have no way
>>>> to get the registers mask of a different arch, like get x86 registers mask
>>>> on arm machine. Another reason is that these __weak helpers may contain
>>>> some arch-specific instructions. If we want to convert them into a general
>>>> perf function like perf_reg_name(). It may cause building error since these
>>>> arch-specific instructions may not exist on the building machine.
>>> Hi Dapeng,
>>>
>>> There was already a patch to better support cross architecture
>>> libdw-unwind-ing and I've just sent out a series to clean this up so
>>> that this is achieved by having mapping functions between perf and
>>> dwarf register names. The functions use the e_machine of the binary to
>>> determine how to map, etc. The series is here:
>>> https://lore.kernel.org/lkml/20260117052849.2205545-1-irogers@google.com/
>>> and I think it can be the foundation for avoiding the weak functions.
>> Hi Ian,
>>
>> Thanks for the reference patch. But they are different. The reference
>> patches mainly parse the regs from perf.data and the __weak functions can
>> be eliminated in the parsing phase since the registers bitmap is fixed for
>> a fixed arch. While these __weak functions
>> arch__intr_reg_mask()/arch__user_reg_mask() are used to obtain the support
>> sampling registers on a specific platform.
>>
>> We know different platforms even for same arch may support different
>> registers, e.g., some x86 platforms may only support XMM registers, but
>> some others may support XMM/YMM/ZMM registers, then all these arch-specific
>> arch__intr_reg_mask()/arch__user_reg_mask() functions have to depend on the
>> perf_event_open() syscall to retrieve the supported registers mask from kernel.
>>
>> Thus, it becomes impossible to retrieve the supported registers mask for a
>> x86 specific platform from running on a arm platform.
>>
>> Even we don't consider this limitation and forcibly convert the
>> __weak arch__intr_reg_mask() function to some kind of below function, just
>> like currently what perf_reg_name() does.
>>
>> uint64_t perf_intr_reg_mask(const char *arch)
>> {
>>     uint64_t mask = 0;
>>
>>     if (!strcmp(arch, "csky"))
>>         mask = perf_intr_reg_mask_csky(id);
>>     else if (!strcmp(arch, "loongarch"))
>>         mask = perf_intr_reg_mask_loongarch(id);
>>     else if (!strcmp(arch, "mips"))
>>         mask = perf_intr_reg_mask_mips(id);
>>     else if (!strcmp(arch, "powerpc"))
>>         mask = perf_intr_reg_mask_powerpc(id);
>>     else if (!strcmp(arch, "riscv"))
>>         mask = perf_intr_reg_mask_riscv(id);
>>     else if (!strcmp(arch, "s390"))
>>         mask = perf_intr_reg_mask_s390(id);
>>     else if (!strcmp(arch, "x86"))
>>         mask = perf_intr_reg_mask_x86(id);
>>     else if (!strcmp(arch, "arm"))
>>         mask = perf_intr_reg_mask_arm(id);
>>     else if (!strcmp(arch, "arm64"))
>>         mask = perf_intr_reg_mask_arm64(id);
>>
>>     return mask;
>> }
>>
>> But currently there are some arch-dependent instructions in these
>> arch-specific instructions, like the below code in powerpc specific
>> arch__intr_reg_mask().
>>
>>     version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);
>>
>> mfspr is a powerpc specific instruction, building this converted
>> perf_intr_reg_mask on non-powerpc platform would lead to building error.
> Hi Dapeng,
>
> So my main point is the arch directory and ifdefs, how do they differ
> from writing code that uses the ELF machine? For example, your code
> uses the arch/x86 directory and has ifdefs on
> HAVE_ARCH_X86_64_SUPPORT. How is that different from:
> ```
> switch(e_machine) {
> case EM_X86_64:
> ...
> case EM_I386:
> ...
> default:
> return 0;
> }
> ```
> If we need to determine for the current running machine then e_machine
> can equal EM_HOST that is set up for this purpose.

I think the key factor that determines if we can convert the code into
above e_machine switch ... case format is whether the code is
architecture-dependent both in building and execution phases.

If the code is not architecture-dependent, It's good to covert the code
into the e_machine switch ... case and that would provide better applicability.

Otherwise, the architecture-dependent code would lead to the building error
(building phase) or get incorrect execution results (execution phase).

Even if we introduce EM_HOST case, it won't really solve the building
error,  instead it may introduce new building error, e.g.,

```
switch(e_machine) {
case EM_HOST:
...
case EM_X86_64:
...
case EM_I386:
...
default:
return 0;
}
```

Assume the code is built on a x86_64 machine, then EM_HOST equals
EM_X86_64, that would cause the "duplicate case value" building error. 

If we want to limit the architecture-dependent code is built only on the
correct architecture, then we still have to introduce the architecture
#ifdefs. This is actually no difference with current arch directory __weak
functions and make it more complex.


>
> I agree that determining features needs calls that may not be
> supported on other architectures. That should yield EOPNOTSUPP and we
> can use information like that to populate generic information like the
> PMU missing features:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n190
> we also probe API support with:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/perf_api_probe.h?h=perf-tools-next

In general, I agree we can return EOPNOTSUPP or some generic information
for some architecture independent code. But it's not applicable for these 2
specific arch__intr_reg_mask()/arch__user_reg_mask() functions, current
perf code depends on these 2 functions to return the supported registers
mask on a specific (running) platform.


>
> The current code doing lots of string comparisons is unnecessary
> overhead and imprecise (x86 is used for both 32-bit and 64-bit x86).
> It is removed in the series I linked to, I think we can eventually get
> rid of the whole arch string for similar reasons of trying to minimize
> the use of the arch directory. I'm curious what happens with APX, will
> the e_machine change? We may need to pass in the sample regs_dump's
> abi field for cases like this.

Yes, I agree we should git rid of the arch-string comparison and minimize
the use of arch directory. It would improve the efficiency.

I don't think the support of APX would change the e_machine, it should
still be EM_X86_64.

Yes, we need the abi filed (exactly PERF_SAMPLE_REGS_ABI_SIMD) to determine
it's APX or legacy XMM.


>
> My point on the unwinding is that the sample register mask appears to
> be set up the same regardless, whereas for stack samples
> (--call-graph=dwarf) maybe just sample IP and SP suffices. So perhaps
> there should be additional registers to set up the sample mask.

Yes, that's true. It can be further optimized.


>
> By avoiding the arch functions we can avoid the problem of broken
> cross architecture support, we can also lay the groundwork for support
> on different architectures that may want to do similar things. I agree
> that doesn't matter until >1 architecture is trying to have more
> register masks, my concern is trying to keep the code generic and
> trying to make sure cross architecture is working. New weak functions
> is going in the opposite direction to that.

Yes, I agree we should git rid of these arch functions as much as possible.
But for these architecture dependent code (as above shows), it seems the
__weak functions are still the simplest and best way to handle them.

Thanks.

>
> Thanks,
> Ian
>
>> -Dapeng Mi
>>
>>> I also noticed that I think we're sampling the XMM registers for dwarf
>>> unwinding, but it seems unlikely the XMM registers will hold stack
>>> frame information - so this is probably an x86 inefficiency.
>>>
>>> Thanks,
>>> Ian
>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-20  3:04                               ` Mi, Dapeng
@ 2026-01-20  5:16                                 ` Ian Rogers
  2026-01-20  6:46                                   ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-20  5:16 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Mon, Jan 19, 2026 at 7:05 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 1/20/2026 4:25 AM, Ian Rogers wrote:
> > On Sun, Jan 18, 2026 at 10:55 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 1/17/2026 1:50 PM, Ian Rogers wrote:
> >>> On Mon, Jan 5, 2026 at 11:27 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>> Ian,
> >>>>
> >>>> I looked at these perf regs __weak helpers again, like
> >>>> arch__intr_reg_mask()/arch__user_reg_mask(). It could be really hard to
> >>>> eliminate these __weak helpers and convert them into a generic function
> >>>> like perf_reg_name(). All these __weak helpers are arch-dependent and
> >>>> usually need to call perf_event_open sysctrl to get the required registers
> >>>> mask. So even we convert them into a generic function, we still have no way
> >>>> to get the registers mask of a different arch, like get x86 registers mask
> >>>> on arm machine. Another reason is that these __weak helpers may contain
> >>>> some arch-specific instructions. If we want to convert them into a general
> >>>> perf function like perf_reg_name(). It may cause building error since these
> >>>> arch-specific instructions may not exist on the building machine.
> >>> Hi Dapeng,
> >>>
> >>> There was already a patch to better support cross architecture
> >>> libdw-unwind-ing and I've just sent out a series to clean this up so
> >>> that this is achieved by having mapping functions between perf and
> >>> dwarf register names. The functions use the e_machine of the binary to
> >>> determine how to map, etc. The series is here:
> >>> https://lore.kernel.org/lkml/20260117052849.2205545-1-irogers@google.com/
> >>> and I think it can be the foundation for avoiding the weak functions.
> >> Hi Ian,
> >>
> >> Thanks for the reference patch. But they are different. The reference
> >> patches mainly parse the regs from perf.data and the __weak functions can
> >> be eliminated in the parsing phase since the registers bitmap is fixed for
> >> a fixed arch. While these __weak functions
> >> arch__intr_reg_mask()/arch__user_reg_mask() are used to obtain the support
> >> sampling registers on a specific platform.
> >>
> >> We know different platforms even for same arch may support different
> >> registers, e.g., some x86 platforms may only support XMM registers, but
> >> some others may support XMM/YMM/ZMM registers, then all these arch-specific
> >> arch__intr_reg_mask()/arch__user_reg_mask() functions have to depend on the
> >> perf_event_open() syscall to retrieve the supported registers mask from kernel.
> >>
> >> Thus, it becomes impossible to retrieve the supported registers mask for a
> >> x86 specific platform from running on a arm platform.
> >>
> >> Even we don't consider this limitation and forcibly convert the
> >> __weak arch__intr_reg_mask() function to some kind of below function, just
> >> like currently what perf_reg_name() does.
> >>
> >> uint64_t perf_intr_reg_mask(const char *arch)
> >> {
> >>     uint64_t mask = 0;
> >>
> >>     if (!strcmp(arch, "csky"))
> >>         mask = perf_intr_reg_mask_csky(id);
> >>     else if (!strcmp(arch, "loongarch"))
> >>         mask = perf_intr_reg_mask_loongarch(id);
> >>     else if (!strcmp(arch, "mips"))
> >>         mask = perf_intr_reg_mask_mips(id);
> >>     else if (!strcmp(arch, "powerpc"))
> >>         mask = perf_intr_reg_mask_powerpc(id);
> >>     else if (!strcmp(arch, "riscv"))
> >>         mask = perf_intr_reg_mask_riscv(id);
> >>     else if (!strcmp(arch, "s390"))
> >>         mask = perf_intr_reg_mask_s390(id);
> >>     else if (!strcmp(arch, "x86"))
> >>         mask = perf_intr_reg_mask_x86(id);
> >>     else if (!strcmp(arch, "arm"))
> >>         mask = perf_intr_reg_mask_arm(id);
> >>     else if (!strcmp(arch, "arm64"))
> >>         mask = perf_intr_reg_mask_arm64(id);
> >>
> >>     return mask;
> >> }
> >>
> >> But currently there are some arch-dependent instructions in these
> >> arch-specific instructions, like the below code in powerpc specific
> >> arch__intr_reg_mask().
> >>
> >>     version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);
> >>
> >> mfspr is a powerpc specific instruction, building this converted
> >> perf_intr_reg_mask on non-powerpc platform would lead to building error.
> > Hi Dapeng,
> >
> > So my main point is the arch directory and ifdefs, how do they differ
> > from writing code that uses the ELF machine? For example, your code
> > uses the arch/x86 directory and has ifdefs on
> > HAVE_ARCH_X86_64_SUPPORT. How is that different from:
> > ```
> > switch(e_machine) {
> > case EM_X86_64:
> > ...
> > case EM_I386:
> > ...
> > default:
> > return 0;
> > }
> > ```
> > If we need to determine for the current running machine then e_machine
> > can equal EM_HOST that is set up for this purpose.
>
> I think the key factor that determines if we can convert the code into
> above e_machine switch ... case format is whether the code is
> architecture-dependent both in building and execution phases.
>
> If the code is not architecture-dependent, It's good to covert the code
> into the e_machine switch ... case and that would provide better applicability.
>
> Otherwise, the architecture-dependent code would lead to the building error
> (building phase) or get incorrect execution results (execution phase).
>
> Even if we introduce EM_HOST case, it won't really solve the building
> error,  instead it may introduce new building error, e.g.,
>
> ```
> switch(e_machine) {
> case EM_HOST:
> ...
> case EM_X86_64:
> ...
> case EM_I386:
> ...
> default:
> return 0;
> }
> ```

No, you wouldn't put EM_HOST as a case statement ever. It is the value
of the ELF machine you are compiling upon, so either EM_X86_64 or
EM_I386 here. You would make `e_machine = EM_HOST` were you to want
some which is specific to the host you are compiling upon. ie `if
(EM_HOST == EM_X86_64 || EM_HOST == EM_I386) { ... }` would be
equivalent to `#ifdef __x86_64__` or equivalent to putting code into
the arch/x86 directory.

> Assume the code is built on a x86_64 machine, then EM_HOST equals
> EM_X86_64, that would cause the "duplicate case value" building error.
>
> If we want to limit the architecture-dependent code is built only on the
> correct architecture, then we still have to introduce the architecture
> #ifdefs. This is actually no difference with current arch directory __weak
> functions and make it more complex.

If we have arch functions for  arch__user_simd_reg_mask then why would
code go through the usual means to determine what the mask is? The
normal means is to get a sample event, use evsel__parse_sample and
then from the sample access struct regs_dump that has within it keeps
a copy of the mask from perf_event_attr associated with the evsel.

In your next patch you do:
https://lore.kernel.org/lkml/20251203065500.2597594-20-dapeng1.mi@linux.intel.com/
```
...
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
...
+static void simd_regs_dump__printf(struct regs_dump *regs, bool intr)
+{
...
+ if (intr)
+ arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
+ else
+ arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
...
 ```
The session code is generic, it may be dealing with live machine data
or with a perf.data file from anywhere. The reason this patch is
exposing these weak functions is for the next patch, there's not a use
here. But why isn't the next patch using struct regs_dump? The struct
regs_dump was set up with sample event and perf_event_attr on hand.
The evsel__parse_sample logic should likely set up a qwords variable
so the generic code can just do:
```
qwords = regs->qwords;
```
The parsing logic should be able to do "nr_vectors * vector_qwords +
nr_pred * pred_qwords" from the perf_event_attr no?

> >
> > I agree that determining features needs calls that may not be
> > supported on other architectures. That should yield EOPNOTSUPP and we
> > can use information like that to populate generic information like the
> > PMU missing features:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n190
> > we also probe API support with:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/perf_api_probe.h?h=perf-tools-next
>
> In general, I agree we can return EOPNOTSUPP or some generic information
> for some architecture independent code. But it's not applicable for these 2
> specific arch__intr_reg_mask()/arch__user_reg_mask() functions, current
> perf code depends on these 2 functions to return the supported registers
> mask on a specific (running) platform.

You can't put code into generic code like session and assume the
perf.data is for the running machine, so the design is wrong.

> >
> > The current code doing lots of string comparisons is unnecessary
> > overhead and imprecise (x86 is used for both 32-bit and 64-bit x86).
> > It is removed in the series I linked to, I think we can eventually get
> > rid of the whole arch string for similar reasons of trying to minimize
> > the use of the arch directory. I'm curious what happens with APX, will
> > the e_machine change? We may need to pass in the sample regs_dump's
> > abi field for cases like this.
>
> Yes, I agree we should git rid of the arch-string comparison and minimize
> the use of arch directory. It would improve the efficiency.
>
> I don't think the support of APX would change the e_machine, it should
> still be EM_X86_64.
>
> Yes, we need the abi filed (exactly PERF_SAMPLE_REGS_ABI_SIMD) to determine
> it's APX or legacy XMM.

Right, in my (unmerged) code to map a perf register to a dwarf register:
https://lore.kernel.org/lkml/20260117052849.2205545-13-irogers@google.com/
we'll need the abi field.

> >
> > My point on the unwinding is that the sample register mask appears to
> > be set up the same regardless, whereas for stack samples
> > (--call-graph=dwarf) maybe just sample IP and SP suffices. So perhaps
> > there should be additional registers to set up the sample mask.
>
> Yes, that's true. It can be further optimized.
>
>
> >
> > By avoiding the arch functions we can avoid the problem of broken
> > cross architecture support, we can also lay the groundwork for support
> > on different architectures that may want to do similar things. I agree
> > that doesn't matter until >1 architecture is trying to have more
> > register masks, my concern is trying to keep the code generic and
> > trying to make sure cross architecture is working. New weak functions
> > is going in the opposite direction to that.
>
> Yes, I agree we should git rid of these arch functions as much as possible.
> But for these architecture dependent code (as above shows), it seems the
> __weak functions are still the simplest and best way to handle them.

So I don't think we should be putting functions that assume the
running machine into generic code like session. The arch functions
create a shortcut that avoids looking at the perf_event_attr,
differences between EM_I386 and EM_X86_64, etc. I'm not sure simpler
matters here, the code is just incorrect relative to how things are
being done around it. How do I grab registers on an APX capable
machine and then dump it on my non-APX laptop? How do the arch
functions account for the differences between EM_I386 and EM_X86_64,
both of which process types may be running on the machine at the same
time and having samples show up in system-wide mode? Having the arch
functions lets things be done wrong and the patch series shows that in
the very next patch.

Thanks,
Ian

> Thanks.
>
> >
> > Thanks,
> > Ian
> >
> >> -Dapeng Mi
> >>
> >>> I also noticed that I think we're sampling the XMM registers for dwarf
> >>> unwinding, but it seems unlikely the XMM registers will hold stack
> >>> frame information - so this is probably an x86 inefficiency.
> >>>
> >>> Thanks,
> >>> Ian
> >>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-20  5:16                                 ` Ian Rogers
@ 2026-01-20  6:46                                   ` Mi, Dapeng
  2026-01-20  6:56                                     ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-20  6:46 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/20/2026 1:16 PM, Ian Rogers wrote:
> On Mon, Jan 19, 2026 at 7:05 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 1/20/2026 4:25 AM, Ian Rogers wrote:
>>> On Sun, Jan 18, 2026 at 10:55 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 1/17/2026 1:50 PM, Ian Rogers wrote:
>>>>> On Mon, Jan 5, 2026 at 11:27 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>> Ian,
>>>>>>
>>>>>> I looked at these perf regs __weak helpers again, like
>>>>>> arch__intr_reg_mask()/arch__user_reg_mask(). It could be really hard to
>>>>>> eliminate these __weak helpers and convert them into a generic function
>>>>>> like perf_reg_name(). All these __weak helpers are arch-dependent and
>>>>>> usually need to call perf_event_open sysctrl to get the required registers
>>>>>> mask. So even we convert them into a generic function, we still have no way
>>>>>> to get the registers mask of a different arch, like get x86 registers mask
>>>>>> on arm machine. Another reason is that these __weak helpers may contain
>>>>>> some arch-specific instructions. If we want to convert them into a general
>>>>>> perf function like perf_reg_name(). It may cause building error since these
>>>>>> arch-specific instructions may not exist on the building machine.
>>>>> Hi Dapeng,
>>>>>
>>>>> There was already a patch to better support cross architecture
>>>>> libdw-unwind-ing and I've just sent out a series to clean this up so
>>>>> that this is achieved by having mapping functions between perf and
>>>>> dwarf register names. The functions use the e_machine of the binary to
>>>>> determine how to map, etc. The series is here:
>>>>> https://lore.kernel.org/lkml/20260117052849.2205545-1-irogers@google.com/
>>>>> and I think it can be the foundation for avoiding the weak functions.
>>>> Hi Ian,
>>>>
>>>> Thanks for the reference patch. But they are different. The reference
>>>> patches mainly parse the regs from perf.data and the __weak functions can
>>>> be eliminated in the parsing phase since the registers bitmap is fixed for
>>>> a fixed arch. While these __weak functions
>>>> arch__intr_reg_mask()/arch__user_reg_mask() are used to obtain the support
>>>> sampling registers on a specific platform.
>>>>
>>>> We know different platforms even for same arch may support different
>>>> registers, e.g., some x86 platforms may only support XMM registers, but
>>>> some others may support XMM/YMM/ZMM registers, then all these arch-specific
>>>> arch__intr_reg_mask()/arch__user_reg_mask() functions have to depend on the
>>>> perf_event_open() syscall to retrieve the supported registers mask from kernel.
>>>>
>>>> Thus, it becomes impossible to retrieve the supported registers mask for a
>>>> x86 specific platform from running on a arm platform.
>>>>
>>>> Even we don't consider this limitation and forcibly convert the
>>>> __weak arch__intr_reg_mask() function to some kind of below function, just
>>>> like currently what perf_reg_name() does.
>>>>
>>>> uint64_t perf_intr_reg_mask(const char *arch)
>>>> {
>>>>     uint64_t mask = 0;
>>>>
>>>>     if (!strcmp(arch, "csky"))
>>>>         mask = perf_intr_reg_mask_csky(id);
>>>>     else if (!strcmp(arch, "loongarch"))
>>>>         mask = perf_intr_reg_mask_loongarch(id);
>>>>     else if (!strcmp(arch, "mips"))
>>>>         mask = perf_intr_reg_mask_mips(id);
>>>>     else if (!strcmp(arch, "powerpc"))
>>>>         mask = perf_intr_reg_mask_powerpc(id);
>>>>     else if (!strcmp(arch, "riscv"))
>>>>         mask = perf_intr_reg_mask_riscv(id);
>>>>     else if (!strcmp(arch, "s390"))
>>>>         mask = perf_intr_reg_mask_s390(id);
>>>>     else if (!strcmp(arch, "x86"))
>>>>         mask = perf_intr_reg_mask_x86(id);
>>>>     else if (!strcmp(arch, "arm"))
>>>>         mask = perf_intr_reg_mask_arm(id);
>>>>     else if (!strcmp(arch, "arm64"))
>>>>         mask = perf_intr_reg_mask_arm64(id);
>>>>
>>>>     return mask;
>>>> }
>>>>
>>>> But currently there are some arch-dependent instructions in these
>>>> arch-specific instructions, like the below code in powerpc specific
>>>> arch__intr_reg_mask().
>>>>
>>>>     version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);
>>>>
>>>> mfspr is a powerpc specific instruction, building this converted
>>>> perf_intr_reg_mask on non-powerpc platform would lead to building error.
>>> Hi Dapeng,
>>>
>>> So my main point is the arch directory and ifdefs, how do they differ
>>> from writing code that uses the ELF machine? For example, your code
>>> uses the arch/x86 directory and has ifdefs on
>>> HAVE_ARCH_X86_64_SUPPORT. How is that different from:
>>> ```
>>> switch(e_machine) {
>>> case EM_X86_64:
>>> ...
>>> case EM_I386:
>>> ...
>>> default:
>>> return 0;
>>> }
>>> ```
>>> If we need to determine for the current running machine then e_machine
>>> can equal EM_HOST that is set up for this purpose.
>> I think the key factor that determines if we can convert the code into
>> above e_machine switch ... case format is whether the code is
>> architecture-dependent both in building and execution phases.
>>
>> If the code is not architecture-dependent, It's good to covert the code
>> into the e_machine switch ... case and that would provide better applicability.
>>
>> Otherwise, the architecture-dependent code would lead to the building error
>> (building phase) or get incorrect execution results (execution phase).
>>
>> Even if we introduce EM_HOST case, it won't really solve the building
>> error,  instead it may introduce new building error, e.g.,
>>
>> ```
>> switch(e_machine) {
>> case EM_HOST:
>> ...
>> case EM_X86_64:
>> ...
>> case EM_I386:
>> ...
>> default:
>> return 0;
>> }
>> ```
> No, you wouldn't put EM_HOST as a case statement ever. It is the value
> of the ELF machine you are compiling upon, so either EM_X86_64 or
> EM_I386 here. You would make `e_machine = EM_HOST` were you to want
> some which is specific to the host you are compiling upon. ie `if
> (EM_HOST == EM_X86_64 || EM_HOST == EM_I386) { ... }` would be
> equivalent to `#ifdef __x86_64__` or equivalent to putting code into
> the arch/x86 directory.

The difference between  `if (EM_HOST == EM_X86_64 || EM_HOST == EM_I386) {
... }` with  `#ifdef __x86_64__` or equivalent __weak functions is on the
building phase. All the "if (EM_HOST == EM_X86_64 || EM_HOST == EM_I386) {
... }" code would be built on any architectural platforms, i.e., the x86
specific code could need to be build on ARM platform. If there are
arch-specific code, then the building would fail, e.g., there are below
code in powerpc specific arch__intr_reg_mask() function.

    version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);

mfspr is a powerpc specific instruction and it would get building error if building on the other arch platforms.

To avoid the building error, we have to introduce the #ifdef again.


I ever tried to covert the __weak arch__intr_reg_mask() functions into the below function.

```
uint64_t perf_intr_reg_mask(const char *arch)
{
    uint64_t mask = 0;

    if (!strcmp(arch, "csky"))
        mask = perf_intr_reg_mask_csky(arch);
    else if (!strcmp(arch, "loongarch"))
        mask = perf_intr_reg_mask_loongarch(arch);
    else if (!strcmp(arch, "mips"))
        mask = perf_intr_reg_mask_mips(arch);
    else if (!strcmp(arch, "powerpc"))
        mask = perf_intr_reg_mask_powerpc(arch);
    else if (!strcmp(arch, "riscv"))
        mask = perf_intr_reg_mask_riscv(arch);
    else if (!strcmp(arch, "s390"))
        mask = perf_intr_reg_mask_s390(arch);
    else if (!strcmp(arch, "x86"))
        mask = perf_intr_reg_mask_x86(arch);
    else if (!strcmp(arch, "arm"))
        mask = perf_intr_reg_mask_arm(arch);
    else if (!strcmp(arch, "arm64"))
        mask = perf_intr_reg_mask_arm64(arch);

    return mask;
}
```

But this causes the building error for the perf_intr_reg_mask_powerpc() on x86 platform since the powerpc specific mfspr instruction. A way to fix the building error is to introduce the "#ifdef __powerpc__", maybe like this,

#ifdef __powerpc__

uint64_t perf_intr_reg_mask(const char *arch)
{
	...
}

#else

uint64_t perf_intr_reg_mask(const char *arch)
{
	return 0;
}
#endif

Do you think it's a correct way to handle the issue?

>
>> Assume the code is built on a x86_64 machine, then EM_HOST equals
>> EM_X86_64, that would cause the "duplicate case value" building error.
>>
>> If we want to limit the architecture-dependent code is built only on the
>> correct architecture, then we still have to introduce the architecture
>> #ifdefs. This is actually no difference with current arch directory __weak
>> functions and make it more complex.
> If we have arch functions for  arch__user_simd_reg_mask then why would
> code go through the usual means to determine what the mask is? The
> normal means is to get a sample event, use evsel__parse_sample and
> then from the sample access struct regs_dump that has within it keeps
> a copy of the mask from perf_event_attr associated with the evsel.
>
> In your next patch you do:
> https://lore.kernel.org/lkml/20251203065500.2597594-20-dapeng1.mi@linux.intel.com/
> ```
> ...
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> ...
> +static void simd_regs_dump__printf(struct regs_dump *regs, bool intr)
> +{
> ...
> + if (intr)
> + arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> + else
> + arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> ...
>  ```
> The session code is generic, it may be dealing with live machine data
> or with a perf.data file from anywhere. The reason this patch is
> exposing these weak functions is for the next patch, there's not a use
> here. But why isn't the next patch using struct regs_dump? The struct
> regs_dump was set up with sample event and perf_event_attr on hand.
> The evsel__parse_sample logic should likely set up a qwords variable
> so the generic code can just do:
> ```
> qwords = regs->qwords;
> ```
> The parsing logic should be able to do "nr_vectors * vector_qwords +
> nr_pred * pred_qwords" from the perf_event_attr no?

The arch__intr_simd_reg_bitmap_qwords() are also used in this patch,  like
the below code

```

static void __print_simd_regs(bool intr, uint64_t simd_mask)
{
    const struct sample_reg *r = NULL;
    uint64_t bitmap = 0;
    u16 qwords = 0;
    int reg_idx;

    if (!simd_mask)
        return;

    for (r = arch__sample_simd_reg_masks(); r->name; r++) {
        if (!(r->mask & simd_mask))
            continue;
        reg_idx = fls64(r->mask) - 1;
        if (intr)
            bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
        else
            bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
        if (bitmap)
            fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
    }
}

```

But I agree this can be convert some kind of generic functions. Let me try
to do it in next version.

>
>>> I agree that determining features needs calls that may not be
>>> supported on other architectures. That should yield EOPNOTSUPP and we
>>> can use information like that to populate generic information like the
>>> PMU missing features:
>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n190
>>> we also probe API support with:
>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/perf_api_probe.h?h=perf-tools-next
>> In general, I agree we can return EOPNOTSUPP or some generic information
>> for some architecture independent code. But it's not applicable for these 2
>> specific arch__intr_reg_mask()/arch__user_reg_mask() functions, current
>> perf code depends on these 2 functions to return the supported registers
>> mask on a specific (running) platform.
> You can't put code into generic code like session and assume the
> perf.data is for the running machine, so the design is wrong.

Ok, let me try to convert them a generic function.


>
>>> The current code doing lots of string comparisons is unnecessary
>>> overhead and imprecise (x86 is used for both 32-bit and 64-bit x86).
>>> It is removed in the series I linked to, I think we can eventually get
>>> rid of the whole arch string for similar reasons of trying to minimize
>>> the use of the arch directory. I'm curious what happens with APX, will
>>> the e_machine change? We may need to pass in the sample regs_dump's
>>> abi field for cases like this.
>> Yes, I agree we should git rid of the arch-string comparison and minimize
>> the use of arch directory. It would improve the efficiency.
>>
>> I don't think the support of APX would change the e_machine, it should
>> still be EM_X86_64.
>>
>> Yes, we need the abi filed (exactly PERF_SAMPLE_REGS_ABI_SIMD) to determine
>> it's APX or legacy XMM.
> Right, in my (unmerged) code to map a perf register to a dwarf register:
> https://lore.kernel.org/lkml/20260117052849.2205545-13-irogers@google.com/
> we'll need the abi field.

Let me try to add the change in the next version.


>
>>> My point on the unwinding is that the sample register mask appears to
>>> be set up the same regardless, whereas for stack samples
>>> (--call-graph=dwarf) maybe just sample IP and SP suffices. So perhaps
>>> there should be additional registers to set up the sample mask.
>> Yes, that's true. It can be further optimized.
>>
>>
>>> By avoiding the arch functions we can avoid the problem of broken
>>> cross architecture support, we can also lay the groundwork for support
>>> on different architectures that may want to do similar things. I agree
>>> that doesn't matter until >1 architecture is trying to have more
>>> register masks, my concern is trying to keep the code generic and
>>> trying to make sure cross architecture is working. New weak functions
>>> is going in the opposite direction to that.
>> Yes, I agree we should git rid of these arch functions as much as possible.
>> But for these architecture dependent code (as above shows), it seems the
>> __weak functions are still the simplest and best way to handle them.
> So I don't think we should be putting functions that assume the
> running machine into generic code like session. The arch functions
> create a shortcut that avoids looking at the perf_event_attr,
> differences between EM_I386 and EM_X86_64, etc. I'm not sure simpler
> matters here, the code is just incorrect relative to how things are
> being done around it. How do I grab registers on an APX capable
> machine and then dump it on my non-APX laptop? How do the arch
> functions account for the differences between EM_I386 and EM_X86_64,
> both of which process types may be running on the machine at the same
> time and having samples show up in system-wide mode? Having the arch
> functions lets things be done wrong and the patch series shows that in
> the very next patch.
>
> Thanks,
> Ian
>
>> Thanks.
>>
>>> Thanks,
>>> Ian
>>>
>>>> -Dapeng Mi
>>>>
>>>>> I also noticed that I think we're sampling the XMM registers for dwarf
>>>>> unwinding, but it seems unlikely the XMM registers will hold stack
>>>>> frame information - so this is probably an x86 inefficiency.
>>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-20  6:46                                   ` Mi, Dapeng
@ 2026-01-20  6:56                                     ` Ian Rogers
  0 siblings, 0 replies; 86+ messages in thread
From: Ian Rogers @ 2026-01-20  6:56 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Mon, Jan 19, 2026 at 10:46 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 1/20/2026 1:16 PM, Ian Rogers wrote:
> > On Mon, Jan 19, 2026 at 7:05 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 1/20/2026 4:25 AM, Ian Rogers wrote:
> >>> On Sun, Jan 18, 2026 at 10:55 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>> On 1/17/2026 1:50 PM, Ian Rogers wrote:
> >>>>> On Mon, Jan 5, 2026 at 11:27 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>>>> Ian,
> >>>>>>
> >>>>>> I looked at these perf regs __weak helpers again, like
> >>>>>> arch__intr_reg_mask()/arch__user_reg_mask(). It could be really hard to
> >>>>>> eliminate these __weak helpers and convert them into a generic function
> >>>>>> like perf_reg_name(). All these __weak helpers are arch-dependent and
> >>>>>> usually need to call perf_event_open sysctrl to get the required registers
> >>>>>> mask. So even we convert them into a generic function, we still have no way
> >>>>>> to get the registers mask of a different arch, like get x86 registers mask
> >>>>>> on arm machine. Another reason is that these __weak helpers may contain
> >>>>>> some arch-specific instructions. If we want to convert them into a general
> >>>>>> perf function like perf_reg_name(). It may cause building error since these
> >>>>>> arch-specific instructions may not exist on the building machine.
> >>>>> Hi Dapeng,
> >>>>>
> >>>>> There was already a patch to better support cross architecture
> >>>>> libdw-unwind-ing and I've just sent out a series to clean this up so
> >>>>> that this is achieved by having mapping functions between perf and
> >>>>> dwarf register names. The functions use the e_machine of the binary to
> >>>>> determine how to map, etc. The series is here:
> >>>>> https://lore.kernel.org/lkml/20260117052849.2205545-1-irogers@google.com/
> >>>>> and I think it can be the foundation for avoiding the weak functions.
> >>>> Hi Ian,
> >>>>
> >>>> Thanks for the reference patch. But they are different. The reference
> >>>> patches mainly parse the regs from perf.data and the __weak functions can
> >>>> be eliminated in the parsing phase since the registers bitmap is fixed for
> >>>> a fixed arch. While these __weak functions
> >>>> arch__intr_reg_mask()/arch__user_reg_mask() are used to obtain the support
> >>>> sampling registers on a specific platform.
> >>>>
> >>>> We know different platforms even for same arch may support different
> >>>> registers, e.g., some x86 platforms may only support XMM registers, but
> >>>> some others may support XMM/YMM/ZMM registers, then all these arch-specific
> >>>> arch__intr_reg_mask()/arch__user_reg_mask() functions have to depend on the
> >>>> perf_event_open() syscall to retrieve the supported registers mask from kernel.
> >>>>
> >>>> Thus, it becomes impossible to retrieve the supported registers mask for a
> >>>> x86 specific platform from running on a arm platform.
> >>>>
> >>>> Even we don't consider this limitation and forcibly convert the
> >>>> __weak arch__intr_reg_mask() function to some kind of below function, just
> >>>> like currently what perf_reg_name() does.
> >>>>
> >>>> uint64_t perf_intr_reg_mask(const char *arch)
> >>>> {
> >>>>     uint64_t mask = 0;
> >>>>
> >>>>     if (!strcmp(arch, "csky"))
> >>>>         mask = perf_intr_reg_mask_csky(id);
> >>>>     else if (!strcmp(arch, "loongarch"))
> >>>>         mask = perf_intr_reg_mask_loongarch(id);
> >>>>     else if (!strcmp(arch, "mips"))
> >>>>         mask = perf_intr_reg_mask_mips(id);
> >>>>     else if (!strcmp(arch, "powerpc"))
> >>>>         mask = perf_intr_reg_mask_powerpc(id);
> >>>>     else if (!strcmp(arch, "riscv"))
> >>>>         mask = perf_intr_reg_mask_riscv(id);
> >>>>     else if (!strcmp(arch, "s390"))
> >>>>         mask = perf_intr_reg_mask_s390(id);
> >>>>     else if (!strcmp(arch, "x86"))
> >>>>         mask = perf_intr_reg_mask_x86(id);
> >>>>     else if (!strcmp(arch, "arm"))
> >>>>         mask = perf_intr_reg_mask_arm(id);
> >>>>     else if (!strcmp(arch, "arm64"))
> >>>>         mask = perf_intr_reg_mask_arm64(id);
> >>>>
> >>>>     return mask;
> >>>> }
> >>>>
> >>>> But currently there are some arch-dependent instructions in these
> >>>> arch-specific instructions, like the below code in powerpc specific
> >>>> arch__intr_reg_mask().
> >>>>
> >>>>     version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);
> >>>>
> >>>> mfspr is a powerpc specific instruction, building this converted
> >>>> perf_intr_reg_mask on non-powerpc platform would lead to building error.
> >>> Hi Dapeng,
> >>>
> >>> So my main point is the arch directory and ifdefs, how do they differ
> >>> from writing code that uses the ELF machine? For example, your code
> >>> uses the arch/x86 directory and has ifdefs on
> >>> HAVE_ARCH_X86_64_SUPPORT. How is that different from:
> >>> ```
> >>> switch(e_machine) {
> >>> case EM_X86_64:
> >>> ...
> >>> case EM_I386:
> >>> ...
> >>> default:
> >>> return 0;
> >>> }
> >>> ```
> >>> If we need to determine for the current running machine then e_machine
> >>> can equal EM_HOST that is set up for this purpose.
> >> I think the key factor that determines if we can convert the code into
> >> above e_machine switch ... case format is whether the code is
> >> architecture-dependent both in building and execution phases.
> >>
> >> If the code is not architecture-dependent, It's good to covert the code
> >> into the e_machine switch ... case and that would provide better applicability.
> >>
> >> Otherwise, the architecture-dependent code would lead to the building error
> >> (building phase) or get incorrect execution results (execution phase).
> >>
> >> Even if we introduce EM_HOST case, it won't really solve the building
> >> error,  instead it may introduce new building error, e.g.,
> >>
> >> ```
> >> switch(e_machine) {
> >> case EM_HOST:
> >> ...
> >> case EM_X86_64:
> >> ...
> >> case EM_I386:
> >> ...
> >> default:
> >> return 0;
> >> }
> >> ```
> > No, you wouldn't put EM_HOST as a case statement ever. It is the value
> > of the ELF machine you are compiling upon, so either EM_X86_64 or
> > EM_I386 here. You would make `e_machine = EM_HOST` were you to want
> > some which is specific to the host you are compiling upon. ie `if
> > (EM_HOST == EM_X86_64 || EM_HOST == EM_I386) { ... }` would be
> > equivalent to `#ifdef __x86_64__` or equivalent to putting code into
> > the arch/x86 directory.
>
> The difference between  `if (EM_HOST == EM_X86_64 || EM_HOST == EM_I386) {
> ... }` with  `#ifdef __x86_64__` or equivalent __weak functions is on the
> building phase. All the "if (EM_HOST == EM_X86_64 || EM_HOST == EM_I386) {
> ... }" code would be built on any architectural platforms, i.e., the x86
> specific code could need to be build on ARM platform. If there are
> arch-specific code, then the building would fail, e.g., there are below
> code in powerpc specific arch__intr_reg_mask() function.
>
>     version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);
>
> mfspr is a powerpc specific instruction and it would get building error if building on the other arch platforms.
>
> To avoid the building error, we have to introduce the #ifdef again.
>
>
> I ever tried to covert the __weak arch__intr_reg_mask() functions into the below function.
>
> ```
> uint64_t perf_intr_reg_mask(const char *arch)
> {
>     uint64_t mask = 0;
>
>     if (!strcmp(arch, "csky"))
>         mask = perf_intr_reg_mask_csky(arch);
>     else if (!strcmp(arch, "loongarch"))
>         mask = perf_intr_reg_mask_loongarch(arch);
>     else if (!strcmp(arch, "mips"))
>         mask = perf_intr_reg_mask_mips(arch);
>     else if (!strcmp(arch, "powerpc"))
>         mask = perf_intr_reg_mask_powerpc(arch);
>     else if (!strcmp(arch, "riscv"))
>         mask = perf_intr_reg_mask_riscv(arch);
>     else if (!strcmp(arch, "s390"))
>         mask = perf_intr_reg_mask_s390(arch);
>     else if (!strcmp(arch, "x86"))
>         mask = perf_intr_reg_mask_x86(arch);
>     else if (!strcmp(arch, "arm"))
>         mask = perf_intr_reg_mask_arm(arch);
>     else if (!strcmp(arch, "arm64"))
>         mask = perf_intr_reg_mask_arm64(arch);
>
>     return mask;
> }
> ```
>
> But this causes the building error for the perf_intr_reg_mask_powerpc() on x86 platform since the powerpc specific mfspr instruction. A way to fix the building error is to introduce the "#ifdef __powerpc__", maybe like this,
>
> #ifdef __powerpc__
>
> uint64_t perf_intr_reg_mask(const char *arch)
> {
>         ...
> }
>
> #else
>
> uint64_t perf_intr_reg_mask(const char *arch)
> {
>         return 0;
> }
> #endif
>
> Do you think it's a correct way to handle the issue?
>
> >
> >> Assume the code is built on a x86_64 machine, then EM_HOST equals
> >> EM_X86_64, that would cause the "duplicate case value" building error.
> >>
> >> If we want to limit the architecture-dependent code is built only on the
> >> correct architecture, then we still have to introduce the architecture
> >> #ifdefs. This is actually no difference with current arch directory __weak
> >> functions and make it more complex.
> > If we have arch functions for  arch__user_simd_reg_mask then why would
> > code go through the usual means to determine what the mask is? The
> > normal means is to get a sample event, use evsel__parse_sample and
> > then from the sample access struct regs_dump that has within it keeps
> > a copy of the mask from perf_event_attr associated with the evsel.
> >
> > In your next patch you do:
> > https://lore.kernel.org/lkml/20251203065500.2597594-20-dapeng1.mi@linux.intel.com/
> > ```
> > ...
> > --- a/tools/perf/util/session.c
> > +++ b/tools/perf/util/session.c
> > ...
> > +static void simd_regs_dump__printf(struct regs_dump *regs, bool intr)
> > +{
> > ...
> > + if (intr)
> > + arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> > + else
> > + arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> > ...
> >  ```
> > The session code is generic, it may be dealing with live machine data
> > or with a perf.data file from anywhere. The reason this patch is
> > exposing these weak functions is for the next patch, there's not a use
> > here. But why isn't the next patch using struct regs_dump? The struct
> > regs_dump was set up with sample event and perf_event_attr on hand.
> > The evsel__parse_sample logic should likely set up a qwords variable
> > so the generic code can just do:
> > ```
> > qwords = regs->qwords;
> > ```
> > The parsing logic should be able to do "nr_vectors * vector_qwords +
> > nr_pred * pred_qwords" from the perf_event_attr no?
>
> The arch__intr_simd_reg_bitmap_qwords() are also used in this patch,  like
> the below code
>
> ```
>
> static void __print_simd_regs(bool intr, uint64_t simd_mask)
> {
>     const struct sample_reg *r = NULL;
>     uint64_t bitmap = 0;
>     u16 qwords = 0;
>     int reg_idx;
>
>     if (!simd_mask)
>         return;
>
>     for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>         if (!(r->mask & simd_mask))
>             continue;
>         reg_idx = fls64(r->mask) - 1;
>         if (intr)
>             bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>         else
>             bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>         if (bitmap)
>             fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>     }
> }
>
> ```
>
> But I agree this can be convert some kind of generic functions. Let me try
> to do it in next version.
>
> >
> >>> I agree that determining features needs calls that may not be
> >>> supported on other architectures. That should yield EOPNOTSUPP and we
> >>> can use information like that to populate generic information like the
> >>> PMU missing features:
> >>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n190
> >>> we also probe API support with:
> >>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/perf_api_probe.h?h=perf-tools-next
> >> In general, I agree we can return EOPNOTSUPP or some generic information
> >> for some architecture independent code. But it's not applicable for these 2
> >> specific arch__intr_reg_mask()/arch__user_reg_mask() functions, current
> >> perf code depends on these 2 functions to return the supported registers
> >> mask on a specific (running) platform.
> > You can't put code into generic code like session and assume the
> > perf.data is for the running machine, so the design is wrong.
>
> Ok, let me try to convert them a generic function.
>
>
> >
> >>> The current code doing lots of string comparisons is unnecessary
> >>> overhead and imprecise (x86 is used for both 32-bit and 64-bit x86).
> >>> It is removed in the series I linked to, I think we can eventually get
> >>> rid of the whole arch string for similar reasons of trying to minimize
> >>> the use of the arch directory. I'm curious what happens with APX, will
> >>> the e_machine change? We may need to pass in the sample regs_dump's
> >>> abi field for cases like this.
> >> Yes, I agree we should git rid of the arch-string comparison and minimize
> >> the use of arch directory. It would improve the efficiency.
> >>
> >> I don't think the support of APX would change the e_machine, it should
> >> still be EM_X86_64.
> >>
> >> Yes, we need the abi filed (exactly PERF_SAMPLE_REGS_ABI_SIMD) to determine
> >> it's APX or legacy XMM.
> > Right, in my (unmerged) code to map a perf register to a dwarf register:
> > https://lore.kernel.org/lkml/20260117052849.2205545-13-irogers@google.com/
> > we'll need the abi field.
>
> Let me try to add the change in the next version.

I'm playing around with the patches to see what can be done. Let me
comment on the original patches. But I'll also try to send out what
I've done.

Thanks,
Ian

> >
> >>> My point on the unwinding is that the sample register mask appears to
> >>> be set up the same regardless, whereas for stack samples
> >>> (--call-graph=dwarf) maybe just sample IP and SP suffices. So perhaps
> >>> there should be additional registers to set up the sample mask.
> >> Yes, that's true. It can be further optimized.
> >>
> >>
> >>> By avoiding the arch functions we can avoid the problem of broken
> >>> cross architecture support, we can also lay the groundwork for support
> >>> on different architectures that may want to do similar things. I agree
> >>> that doesn't matter until >1 architecture is trying to have more
> >>> register masks, my concern is trying to keep the code generic and
> >>> trying to make sure cross architecture is working. New weak functions
> >>> is going in the opposite direction to that.
> >> Yes, I agree we should git rid of these arch functions as much as possible.
> >> But for these architecture dependent code (as above shows), it seems the
> >> __weak functions are still the simplest and best way to handle them.
> > So I don't think we should be putting functions that assume the
> > running machine into generic code like session. The arch functions
> > create a shortcut that avoids looking at the perf_event_attr,
> > differences between EM_I386 and EM_X86_64, etc. I'm not sure simpler
> > matters here, the code is just incorrect relative to how things are
> > being done around it. How do I grab registers on an APX capable
> > machine and then dump it on my non-APX laptop? How do the arch
> > functions account for the differences between EM_I386 and EM_X86_64,
> > both of which process types may be running on the machine at the same
> > time and having samples show up in system-wide mode? Having the arch
> > functions lets things be done wrong and the patch series shows that in
> > the very next patch.
> >
> > Thanks,
> > Ian
> >
> >> Thanks.
> >>
> >>> Thanks,
> >>> Ian
> >>>
> >>>> -Dapeng Mi
> >>>>
> >>>>> I also noticed that I think we're sampling the XMM registers for dwarf
> >>>>> unwinding, but it seems unlikely the XMM registers will hold stack
> >>>>> frame information - so this is probably an x86 inefficiency.
> >>>>>
> >>>>> Thanks,
> >>>>> Ian
> >>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2025-12-03  6:54 ` [Patch v5 17/19] perf headers: Sync with the kernel headers Dapeng Mi
  2025-12-03 23:43   ` Ian Rogers
@ 2026-01-20  7:01   ` Ian Rogers
  2026-01-20  7:25     ` Mi, Dapeng
  2026-01-20  7:16   ` Ian Rogers
  2 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-20  7:01 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Update include/uapi/linux/perf_event.h and
> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
>  2 files changed, 103 insertions(+), 4 deletions(-)
>
> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
> index 7c9d2bb3833b..f3561ed10041 100644
> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
>         PERF_REG_X86_R13,
>         PERF_REG_X86_R14,
>         PERF_REG_X86_R15,
> +       /*
> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
> +        *
> +        * Extended GPRs (EGPRs)
> +        */
> +       PERF_REG_X86_R16,
> +       PERF_REG_X86_R17,
> +       PERF_REG_X86_R18,
> +       PERF_REG_X86_R19,
> +       PERF_REG_X86_R20,
> +       PERF_REG_X86_R21,
> +       PERF_REG_X86_R22,
> +       PERF_REG_X86_R23,
> +       PERF_REG_X86_R24,
> +       PERF_REG_X86_R25,
> +       PERF_REG_X86_R26,
> +       PERF_REG_X86_R27,
> +       PERF_REG_X86_R28,
> +       PERF_REG_X86_R29,
> +       PERF_REG_X86_R30,
> +       PERF_REG_X86_R31,
> +       PERF_REG_X86_SSP,
>         /* These are the limits for the GPRs. */
>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
>
>         /* These all need two bits set because they are 128bit */
>         PERF_REG_X86_XMM0  = 32,
> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
>  };
>
>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
> +
> +enum {
> +       PERF_REG_X86_XMM,
> +       PERF_REG_X86_YMM,
> +       PERF_REG_X86_ZMM,
> +       PERF_REG_X86_MAX_SIMD_REGS,
> +
> +       PERF_REG_X86_OPMASK = 0,
> +       PERF_REG_X86_MAX_PRED_REGS = 1,
> +};
> +
> +enum {
> +       PERF_X86_SIMD_XMM_REGS      = 16,
> +       PERF_X86_SIMD_YMM_REGS      = 16,
> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
> +       PERF_X86_SIMD_ZMM_REGS      = 32,
> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
> +
> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
> +};
> +
> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
> +
> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
> +
> +enum {
> +       PERF_X86_OPMASK_QWORDS   = 1,
> +       PERF_X86_XMM_QWORDS      = 2,
> +       PERF_X86_YMMH_QWORDS     = 2,
> +       PERF_X86_YMM_QWORDS      = 4,
> +       PERF_X86_ZMMH_QWORDS     = 4,
> +       PERF_X86_ZMM_QWORDS      = 8,
> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
> +};
>
>  #endif /* _ASM_X86_PERF_REGS_H */
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index d292f96bc06f..f1474da32622 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -314,8 +314,9 @@ enum {
>   */
>  enum perf_sample_regs_abi {
>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
>  };
>
>  /*
> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
>
>  /*
>   * 'struct perf_event_attr' contains various attributes that define
> @@ -545,6 +547,25 @@ struct perf_event_attr {
>         __u64   sig_data;
>
>         __u64   config3; /* extension of config2 */
> +
> +
> +       /*
> +        * Defines set of SIMD registers to dump on samples.
> +        * The sample_simd_regs_enabled !=0 implies the
> +        * set of SIMD registers is used to config all SIMD registers.
> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> +        * config some SIMD registers on X86.
> +        */
> +       union {
> +               __u16 sample_simd_regs_enabled;
> +               __u16 sample_simd_pred_reg_qwords;
> +       };

Shouldn't there be 16-bits of reserved data here as the next __u32
will require it for alignment?

> +       __u32 sample_simd_pred_reg_intr;
> +       __u32 sample_simd_pred_reg_user;
> +       __u16 sample_simd_vec_reg_qwords;

Shouldn't there be 48-bits of reserved data here as the next __u64
will require it for alignment?

I wonder the order should be:
```
union {
__u16 sample_simd_regs_enabled;
__u16 sample_simd_pred_reg_qwords;
};
__u16 sample_simd_vec_reg_qwords;
__u32 sample_simd_pred_reg_intr;
__u32 sample_simd_pred_reg_user;
__u32 __reserved_4;
__u64 sample_simd_vec_reg_intr;
__u64 sample_simd_vec_reg_user;
```

Thanks,
Ian

> +       __u64 sample_simd_vec_reg_intr;
> +       __u64 sample_simd_vec_reg_user;
> +       __u32 __reserved_4;
>  };
>
>  /*
> @@ -1018,7 +1039,15 @@ enum perf_event_type {
>          *      } && PERF_SAMPLE_BRANCH_STACK
>          *
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_USER
>          *
>          *      { u64                   size;
>          *        char                  data[size];
> @@ -1045,7 +1074,15 @@ enum perf_event_type {
>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +        *      } && PERF_SAMPLE_REGS_INTR
>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2025-12-03  6:54 ` [Patch v5 17/19] perf headers: Sync with the kernel headers Dapeng Mi
  2025-12-03 23:43   ` Ian Rogers
  2026-01-20  7:01   ` Ian Rogers
@ 2026-01-20  7:16   ` Ian Rogers
  2026-01-20  7:43     ` Mi, Dapeng
  2 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-20  7:16 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Update include/uapi/linux/perf_event.h and
> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
>  2 files changed, 103 insertions(+), 4 deletions(-)
>
> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
> index 7c9d2bb3833b..f3561ed10041 100644
> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
>         PERF_REG_X86_R13,
>         PERF_REG_X86_R14,
>         PERF_REG_X86_R15,
> +       /*
> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
> +        *
> +        * Extended GPRs (EGPRs)
> +        */
> +       PERF_REG_X86_R16,
> +       PERF_REG_X86_R17,
> +       PERF_REG_X86_R18,
> +       PERF_REG_X86_R19,
> +       PERF_REG_X86_R20,
> +       PERF_REG_X86_R21,
> +       PERF_REG_X86_R22,
> +       PERF_REG_X86_R23,
> +       PERF_REG_X86_R24,
> +       PERF_REG_X86_R25,
> +       PERF_REG_X86_R26,
> +       PERF_REG_X86_R27,
> +       PERF_REG_X86_R28,
> +       PERF_REG_X86_R29,
> +       PERF_REG_X86_R30,
> +       PERF_REG_X86_R31,
> +       PERF_REG_X86_SSP,
>         /* These are the limits for the GPRs. */
>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
>
>         /* These all need two bits set because they are 128bit */
>         PERF_REG_X86_XMM0  = 32,
> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
>  };
>
>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
> +
> +enum {
> +       PERF_REG_X86_XMM,
> +       PERF_REG_X86_YMM,
> +       PERF_REG_X86_ZMM,
> +       PERF_REG_X86_MAX_SIMD_REGS,
> +
> +       PERF_REG_X86_OPMASK = 0,
> +       PERF_REG_X86_MAX_PRED_REGS = 1,
> +};
> +
> +enum {
> +       PERF_X86_SIMD_XMM_REGS      = 16,
> +       PERF_X86_SIMD_YMM_REGS      = 16,
> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
> +       PERF_X86_SIMD_ZMM_REGS      = 32,
> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
> +
> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
> +};
> +
> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
> +
> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
> +
> +enum {
> +       PERF_X86_OPMASK_QWORDS   = 1,
> +       PERF_X86_XMM_QWORDS      = 2,
> +       PERF_X86_YMMH_QWORDS     = 2,
> +       PERF_X86_YMM_QWORDS      = 4,
> +       PERF_X86_ZMMH_QWORDS     = 4,
> +       PERF_X86_ZMM_QWORDS      = 8,
> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
> +};
>
>  #endif /* _ASM_X86_PERF_REGS_H */
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index d292f96bc06f..f1474da32622 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -314,8 +314,9 @@ enum {
>   */
>  enum perf_sample_regs_abi {
>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
>  };
>
>  /*
> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
>
>  /*
>   * 'struct perf_event_attr' contains various attributes that define
> @@ -545,6 +547,25 @@ struct perf_event_attr {
>         __u64   sig_data;
>
>         __u64   config3; /* extension of config2 */
> +
> +
> +       /*
> +        * Defines set of SIMD registers to dump on samples.
> +        * The sample_simd_regs_enabled !=0 implies the
> +        * set of SIMD registers is used to config all SIMD registers.
> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> +        * config some SIMD registers on X86.
> +        */
> +       union {
> +               __u16 sample_simd_regs_enabled;
> +               __u16 sample_simd_pred_reg_qwords;
> +       };
> +       __u32 sample_simd_pred_reg_intr;
> +       __u32 sample_simd_pred_reg_user;
> +       __u16 sample_simd_vec_reg_qwords;
> +       __u64 sample_simd_vec_reg_intr;
> +       __u64 sample_simd_vec_reg_user;
> +       __u32 __reserved_4;
>  };
>
>  /*
> @@ -1018,7 +1039,15 @@ enum perf_event_type {
>          *      } && PERF_SAMPLE_BRANCH_STACK
>          *
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)

Why can't these values be taken from the perf_event_attr? The abi is
needed as there could be both 32-bit and 64-bit samples for the same
event - presumably x32 appears as 64-bit. If the ABI has SIMD within
it (implied by the "} && (abi & PERF_SAMPLE_REGS_ABI_SIMD)" below)
then why can't we just use the perf_event_attr values? For example,
data could be "data[weight(sample_simd_vec_reg_user) *
sample_simd_vec_reg_qwords + weight(sample_simd_pred_reg_user) *
sample_simd_pred_reg_qwords]".

> +        *      } && PERF_SAMPLE_REGS_USER
>          *
>          *      { u64                   size;
>          *        char                  data[size];
> @@ -1045,7 +1074,15 @@ enum perf_event_type {
>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>          *      { u64                   abi; # enum perf_sample_regs_abi
> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> +        *        u64                   regs[weight(mask)];
> +        *        struct {
> +        *              u16 nr_vectors;
> +        *              u16 vector_qwords;
> +        *              u16 nr_pred;
> +        *              u16 pred_qwords;
> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)

Same comment.

Thanks,
Ian

> +        *      } && PERF_SAMPLE_REGS_INTR
>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2026-01-20  7:01   ` Ian Rogers
@ 2026-01-20  7:25     ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-20  7:25 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/20/2026 3:01 PM, Ian Rogers wrote:
> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Update include/uapi/linux/perf_event.h and
>> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
>>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
>>  2 files changed, 103 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
>> index 7c9d2bb3833b..f3561ed10041 100644
>> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
>> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
>> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
>>         PERF_REG_X86_R13,
>>         PERF_REG_X86_R14,
>>         PERF_REG_X86_R15,
>> +       /*
>> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
>> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
>> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
>> +        *
>> +        * Extended GPRs (EGPRs)
>> +        */
>> +       PERF_REG_X86_R16,
>> +       PERF_REG_X86_R17,
>> +       PERF_REG_X86_R18,
>> +       PERF_REG_X86_R19,
>> +       PERF_REG_X86_R20,
>> +       PERF_REG_X86_R21,
>> +       PERF_REG_X86_R22,
>> +       PERF_REG_X86_R23,
>> +       PERF_REG_X86_R24,
>> +       PERF_REG_X86_R25,
>> +       PERF_REG_X86_R26,
>> +       PERF_REG_X86_R27,
>> +       PERF_REG_X86_R28,
>> +       PERF_REG_X86_R29,
>> +       PERF_REG_X86_R30,
>> +       PERF_REG_X86_R31,
>> +       PERF_REG_X86_SSP,
>>         /* These are the limits for the GPRs. */
>>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
>> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
>>
>>         /* These all need two bits set because they are 128bit */
>>         PERF_REG_X86_XMM0  = 32,
>> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
>>  };
>>
>>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
>> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
>> +
>> +enum {
>> +       PERF_REG_X86_XMM,
>> +       PERF_REG_X86_YMM,
>> +       PERF_REG_X86_ZMM,
>> +       PERF_REG_X86_MAX_SIMD_REGS,
>> +
>> +       PERF_REG_X86_OPMASK = 0,
>> +       PERF_REG_X86_MAX_PRED_REGS = 1,
>> +};
>> +
>> +enum {
>> +       PERF_X86_SIMD_XMM_REGS      = 16,
>> +       PERF_X86_SIMD_YMM_REGS      = 16,
>> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
>> +       PERF_X86_SIMD_ZMM_REGS      = 32,
>> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
>> +
>> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
>> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
>> +};
>> +
>> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
>> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
>> +
>> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
>> +
>> +enum {
>> +       PERF_X86_OPMASK_QWORDS   = 1,
>> +       PERF_X86_XMM_QWORDS      = 2,
>> +       PERF_X86_YMMH_QWORDS     = 2,
>> +       PERF_X86_YMM_QWORDS      = 4,
>> +       PERF_X86_ZMMH_QWORDS     = 4,
>> +       PERF_X86_ZMM_QWORDS      = 8,
>> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
>> +};
>>
>>  #endif /* _ASM_X86_PERF_REGS_H */
>> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
>> index d292f96bc06f..f1474da32622 100644
>> --- a/tools/include/uapi/linux/perf_event.h
>> +++ b/tools/include/uapi/linux/perf_event.h
>> @@ -314,8 +314,9 @@ enum {
>>   */
>>  enum perf_sample_regs_abi {
>>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
>> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
>> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
>> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
>> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
>> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
>>  };
>>
>>  /*
>> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
>>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
>>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
>> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
>>
>>  /*
>>   * 'struct perf_event_attr' contains various attributes that define
>> @@ -545,6 +547,25 @@ struct perf_event_attr {
>>         __u64   sig_data;
>>
>>         __u64   config3; /* extension of config2 */
>> +
>> +
>> +       /*
>> +        * Defines set of SIMD registers to dump on samples.
>> +        * The sample_simd_regs_enabled !=0 implies the
>> +        * set of SIMD registers is used to config all SIMD registers.
>> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
>> +        * config some SIMD registers on X86.
>> +        */
>> +       union {
>> +               __u16 sample_simd_regs_enabled;
>> +               __u16 sample_simd_pred_reg_qwords;
>> +       };
> Shouldn't there be 16-bits of reserved data here as the next __u32
> will require it for alignment?
>
>> +       __u32 sample_simd_pred_reg_intr;
>> +       __u32 sample_simd_pred_reg_user;
>> +       __u16 sample_simd_vec_reg_qwords;
> Shouldn't there be 48-bits of reserved data here as the next __u64
> will require it for alignment?
>
> I wonder the order should be:
> ```
> union {
> __u16 sample_simd_regs_enabled;
> __u16 sample_simd_pred_reg_qwords;
> };
> __u16 sample_simd_vec_reg_qwords;
> __u32 sample_simd_pred_reg_intr;
> __u32 sample_simd_pred_reg_user;
> __u32 __reserved_4;
> __u64 sample_simd_vec_reg_intr;
> __u64 sample_simd_vec_reg_user;

Yes, Peter has raised same question and it would be fixed in next version.


> ```
>
> Thanks,
> Ian
>
>> +       __u64 sample_simd_vec_reg_intr;
>> +       __u64 sample_simd_vec_reg_user;
>> +       __u32 __reserved_4;
>>  };
>>
>>  /*
>> @@ -1018,7 +1039,15 @@ enum perf_event_type {
>>          *      } && PERF_SAMPLE_BRANCH_STACK
>>          *
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +        *      } && PERF_SAMPLE_REGS_USER
>>          *
>>          *      { u64                   size;
>>          *        char                  data[size];
>> @@ -1045,7 +1074,15 @@ enum perf_event_type {
>>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +        *      } && PERF_SAMPLE_REGS_INTR
>>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2025-12-03  6:54 ` [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format Dapeng Mi
  2025-12-04  0:17   ` Ian Rogers
@ 2026-01-20  7:39   ` Ian Rogers
  2026-01-20  9:04     ` Mi, Dapeng
  1 sibling, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-20  7:39 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> This patch adds support for the newly introduced SIMD register sampling
> format by adding the following functions:
>
> uint64_t arch__intr_simd_reg_mask(void);
> uint64_t arch__user_simd_reg_mask(void);
> uint64_t arch__intr_pred_reg_mask(void);
> uint64_t arch__user_pred_reg_mask(void);
> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>
> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>
> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> supported PRED registers, such as OPMASK on x86 platforms.
>
> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> exact bitmap and number of qwords for a specific type of SIMD register.
> For example, for XMM registers on x86 platforms, the returned bitmap is
> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>
> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> exact bitmap and number of qwords for a specific type of PRED register.
> For example, for OPMASK registers on x86 platforms, the returned bitmap
> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> OPMASK).
>
> Additionally, the function __parse_regs() is enhanced to support parsing
> these newly introduced SIMD registers. Currently, each type of register
> can only be sampled collectively; sampling a specific SIMD register is
> not supported. For example, all XMM registers are sampled together rather
> than sampling only XMM0.
>
> When multiple overlapping register types, such as XMM and YMM, are
> sampled simultaneously, only the superset (YMM registers) is sampled.
>
> With this patch, all supported sampling registers on x86 platforms are
> displayed as follows.
>
>  $perf record -I?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
>  $perf record --user-regs=?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>  tools/perf/util/evsel.c                   |  27 ++
>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>  tools/perf/util/perf_regs.c               |  59 +++
>  tools/perf/util/perf_regs.h               |  11 +
>  tools/perf/util/record.h                  |   6 +
>  7 files changed, 714 insertions(+), 16 deletions(-)
>
> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> index 12fd93f04802..db41430f3b07 100644
> --- a/tools/perf/arch/x86/util/perf_regs.c
> +++ b/tools/perf/arch/x86/util/perf_regs.c
> @@ -13,6 +13,49 @@
>  #include "../../../util/pmu.h"
>  #include "../../../util/pmus.h"
>
> +static const struct sample_reg sample_reg_masks_ext[] = {
> +       SMPL_REG(AX, PERF_REG_X86_AX),
> +       SMPL_REG(BX, PERF_REG_X86_BX),
> +       SMPL_REG(CX, PERF_REG_X86_CX),
> +       SMPL_REG(DX, PERF_REG_X86_DX),
> +       SMPL_REG(SI, PERF_REG_X86_SI),
> +       SMPL_REG(DI, PERF_REG_X86_DI),
> +       SMPL_REG(BP, PERF_REG_X86_BP),
> +       SMPL_REG(SP, PERF_REG_X86_SP),
> +       SMPL_REG(IP, PERF_REG_X86_IP),
> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> +       SMPL_REG(CS, PERF_REG_X86_CS),
> +       SMPL_REG(SS, PERF_REG_X86_SS),
> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> +       SMPL_REG(R8, PERF_REG_X86_R8),
> +       SMPL_REG(R9, PERF_REG_X86_R9),
> +       SMPL_REG(R10, PERF_REG_X86_R10),
> +       SMPL_REG(R11, PERF_REG_X86_R11),
> +       SMPL_REG(R12, PERF_REG_X86_R12),
> +       SMPL_REG(R13, PERF_REG_X86_R13),
> +       SMPL_REG(R14, PERF_REG_X86_R14),
> +       SMPL_REG(R15, PERF_REG_X86_R15),
> +       SMPL_REG(R16, PERF_REG_X86_R16),
> +       SMPL_REG(R17, PERF_REG_X86_R17),
> +       SMPL_REG(R18, PERF_REG_X86_R18),
> +       SMPL_REG(R19, PERF_REG_X86_R19),
> +       SMPL_REG(R20, PERF_REG_X86_R20),
> +       SMPL_REG(R21, PERF_REG_X86_R21),
> +       SMPL_REG(R22, PERF_REG_X86_R22),
> +       SMPL_REG(R23, PERF_REG_X86_R23),
> +       SMPL_REG(R24, PERF_REG_X86_R24),
> +       SMPL_REG(R25, PERF_REG_X86_R25),
> +       SMPL_REG(R26, PERF_REG_X86_R26),
> +       SMPL_REG(R27, PERF_REG_X86_R27),
> +       SMPL_REG(R28, PERF_REG_X86_R28),
> +       SMPL_REG(R29, PERF_REG_X86_R29),
> +       SMPL_REG(R30, PERF_REG_X86_R30),
> +       SMPL_REG(R31, PERF_REG_X86_R31),
> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> +#endif
> +       SMPL_REG_END
> +};
> +
>  static const struct sample_reg sample_reg_masks[] = {
>         SMPL_REG(AX, PERF_REG_X86_AX),
>         SMPL_REG(BX, PERF_REG_X86_BX),
> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>         return SDT_ARG_VALID;
>  }
>
> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> +{
> +       struct perf_event_attr attr = {
> +               .type                           = PERF_TYPE_HARDWARE,
> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> +               .sample_type                    = sample_type,
> +               .disabled                       = 1,
> +               .exclude_kernel                 = 1,
> +               .sample_simd_regs_enabled       = 1,
> +       };
> +       int fd;
> +
> +       attr.sample_period = 1;
> +
> +       if (!pred) {
> +               attr.sample_simd_vec_reg_qwords = qwords;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       attr.sample_simd_vec_reg_intr = mask;
> +               else
> +                       attr.sample_simd_vec_reg_user = mask;
> +       } else {
> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> +               else
> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> +       }
> +
> +       if (perf_pmus__num_core_pmus() > 1) {
> +               struct perf_pmu *pmu = NULL;
> +               __u64 type = PERF_TYPE_RAW;
> +
> +               /*
> +                * The same register set is supported among different hybrid PMUs.
> +                * Only check the first available one.
> +                */
> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> +                       type = pmu->type;
> +                       break;
> +               }
> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> +       }
> +
> +       event_attr_init(&attr);
> +
> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> +       if (fd != -1) {
> +               close(fd);
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> +{
> +       bool supported = false;
> +       u64 bits;
> +
> +       *mask = 0;
> +       *qwords = 0;
> +
> +       switch (reg) {
> +       case PERF_REG_X86_XMM:
> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_XMM_QWORDS;
> +               }
> +               break;
> +       case PERF_REG_X86_YMM:
> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_YMM_QWORDS;
> +               }
> +               break;
> +       case PERF_REG_X86_ZMM:
> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_ZMM_QWORDS;
> +                       break;
> +               }
> +
> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> +               }
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       return supported;
> +}
> +
> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> +{
> +       bool supported = false;
> +       u64 bits;
> +
> +       *mask = 0;
> +       *qwords = 0;
> +
> +       switch (reg) {
> +       case PERF_REG_X86_OPMASK:
> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> +               if (supported) {
> +                       *mask = bits;
> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> +               }
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       return supported;
> +}
> +
> +static bool has_cap_simd_regs(void)
> +{
> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> +       u16 qwords = PERF_X86_XMM_QWORDS;
> +       static bool has_cap_simd_regs;
> +       static bool cached;
> +
> +       if (cached)
> +               return has_cap_simd_regs;
> +
> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> +       cached = true;
> +
> +       return has_cap_simd_regs;
> +}
> +
> +bool arch_has_simd_regs(u64 mask)
> +{
> +       return has_cap_simd_regs() &&
> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> +}
> +
> +static const struct sample_reg sample_simd_reg_masks[] = {
> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> +       SMPL_REG_END
> +};
> +
> +static const struct sample_reg sample_pred_reg_masks[] = {
> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> +       SMPL_REG_END
> +};
> +
> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> +{
> +       return sample_simd_reg_masks;
> +}
> +
> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> +{
> +       return sample_pred_reg_masks;
> +}
> +
> +static bool x86_intr_simd_updated;
> +static u64 x86_intr_simd_reg_mask;
> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> +static bool x86_user_simd_updated;
> +static u64 x86_user_simd_reg_mask;
> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> +
> +static bool x86_intr_pred_updated;
> +static u64 x86_intr_pred_reg_mask;
> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> +static bool x86_user_pred_updated;
> +static u64 x86_user_pred_reg_mask;
> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> +
> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool supported;
> +       u64 mask = 0;
> +       int reg;
> +
> +       if (!has_cap_simd_regs())
> +               return 0;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> +               return x86_intr_simd_reg_mask;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> +               return x86_user_simd_reg_mask;
> +
> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +               supported = false;
> +
> +               if (!r->mask)
> +                       continue;
> +               reg = fls64(r->mask) - 1;
> +
> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> +                       break;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> +                                                        &x86_intr_simd_mask[reg],
> +                                                        &x86_intr_simd_qwords[reg]);
> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> +                                                        &x86_user_simd_mask[reg],
> +                                                        &x86_user_simd_qwords[reg]);
> +               if (supported)
> +                       mask |= BIT_ULL(reg);
> +       }
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> +               x86_intr_simd_reg_mask = mask;
> +               x86_intr_simd_updated = true;
> +       } else {
> +               x86_user_simd_reg_mask = mask;
> +               x86_user_simd_updated = true;
> +       }
> +
> +       return mask;
> +}
> +
> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool supported;
> +       u64 mask = 0;
> +       int reg;
> +
> +       if (!has_cap_simd_regs())
> +               return 0;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> +               return x86_intr_pred_reg_mask;
> +
> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> +               return x86_user_pred_reg_mask;
> +
> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> +               supported = false;
> +
> +               if (!r->mask)
> +                       continue;
> +               reg = fls64(r->mask) - 1;
> +
> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> +                       break;
> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> +                                                        &x86_intr_pred_mask[reg],
> +                                                        &x86_intr_pred_qwords[reg]);
> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> +                                                        &x86_user_pred_mask[reg],
> +                                                        &x86_user_pred_qwords[reg]);
> +               if (supported)
> +                       mask |= BIT_ULL(reg);
> +       }
> +
> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> +               x86_intr_pred_reg_mask = mask;
> +               x86_intr_pred_updated = true;
> +       } else {
> +               x86_user_pred_reg_mask = mask;
> +               x86_user_pred_updated = true;
> +       }
> +
> +       return mask;
> +}
> +
> +uint64_t arch__intr_simd_reg_mask(void)
> +{
> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> +}
> +
> +uint64_t arch__user_simd_reg_mask(void)
> +{
> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> +}
> +
> +uint64_t arch__intr_pred_reg_mask(void)
> +{
> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> +}
> +
> +uint64_t arch__user_pred_reg_mask(void)
> +{
> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> +}
> +
> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> +{
> +       uint64_t mask = 0;
> +
> +       *qwords = 0;
> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> +               if (intr) {
> +                       *qwords = x86_intr_simd_qwords[reg];
> +                       mask = x86_intr_simd_mask[reg];
> +               } else {
> +                       *qwords = x86_user_simd_qwords[reg];
> +                       mask = x86_user_simd_mask[reg];
> +               }
> +       }
> +
> +       return mask;
> +}
> +
> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> +{
> +       uint64_t mask = 0;
> +
> +       *qwords = 0;
> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> +               if (intr) {
> +                       *qwords = x86_intr_pred_qwords[reg];
> +                       mask = x86_intr_pred_mask[reg];
> +               } else {
> +                       *qwords = x86_user_pred_qwords[reg];
> +                       mask = x86_user_pred_mask[reg];
> +               }
> +       }
> +
> +       return mask;
> +}
> +
> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_intr_simd_updated)
> +               arch__intr_simd_reg_mask();
> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> +}
> +
> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_user_simd_updated)
> +               arch__user_simd_reg_mask();
> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> +}
> +
> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_intr_pred_updated)
> +               arch__intr_pred_reg_mask();
> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> +}
> +
> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> +{
> +       if (!x86_user_pred_updated)
> +               arch__user_pred_reg_mask();
> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> +}
> +
>  const struct sample_reg *arch__sample_reg_masks(void)
>  {
> +       if (has_cap_simd_regs())
> +               return sample_reg_masks_ext;
>         return sample_reg_masks;
>  }
>
> -uint64_t arch__intr_reg_mask(void)
> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>  {
>         struct perf_event_attr attr = {
> -               .type                   = PERF_TYPE_HARDWARE,
> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> -               .precise_ip             = 1,
> -               .disabled               = 1,
> -               .exclude_kernel         = 1,
> +               .type                           = PERF_TYPE_HARDWARE,
> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> +               .sample_type                    = sample_type,
> +               .precise_ip                     = 1,
> +               .disabled                       = 1,
> +               .exclude_kernel                 = 1,
> +               .sample_simd_regs_enabled       = has_simd_regs,
>         };
>         int fd;
>         /*
>          * In an unnamed union, init it here to build on older gcc versions
>          */
>         attr.sample_period = 1;
> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> +               attr.sample_regs_intr = mask;
> +       else
> +               attr.sample_regs_user = mask;
>
>         if (perf_pmus__num_core_pmus() > 1) {
>                 struct perf_pmu *pmu = NULL;
> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>         if (fd != -1) {
>                 close(fd);
> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> +               return mask;
>         }
>
> -       return PERF_REGS_MASK;
> +       return 0;
> +}
> +
> +uint64_t arch__intr_reg_mask(void)
> +{
> +       uint64_t mask = PERF_REGS_MASK;
> +
> +       if (has_cap_simd_regs()) {
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> +                                        true);
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> +                                        BIT_ULL(PERF_REG_X86_SSP),
> +                                        true);
> +       } else
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> +
> +       return mask;
>  }
>
>  uint64_t arch__user_reg_mask(void)
>  {
> -       return PERF_REGS_MASK;
> +       uint64_t mask = PERF_REGS_MASK;
> +
> +       if (has_cap_simd_regs()) {
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> +                                        true);
> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> +                                        BIT_ULL(PERF_REG_X86_SSP),
> +                                        true);
> +       }
> +
> +       return mask;
>  }
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 56ebefd075f2..5d1d90cf9488 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>             !evsel__is_dummy_event(evsel)) {
>                 attr->sample_regs_intr = opts->sample_intr_regs;
> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> +               evsel__set_sample_bit(evsel, REGS_INTR);
> +       }
> +
> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> +               /* The pred qwords is to implies the set of SIMD registers is used */
> +               if (opts->sample_pred_regs_qwords)
> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> +               else
> +                       attr->sample_simd_pred_reg_qwords = 1;
> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>                 evsel__set_sample_bit(evsel, REGS_INTR);
>         }
>
>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>             !evsel__is_dummy_event(evsel)) {
>                 attr->sample_regs_user |= opts->sample_user_regs;
> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> +               evsel__set_sample_bit(evsel, REGS_USER);
> +       }
> +
> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> +               if (opts->sample_pred_regs_qwords)
> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> +               else
> +                       attr->sample_simd_pred_reg_qwords = 1;
> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>                 evsel__set_sample_bit(evsel, REGS_USER);
>         }
>
> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> index cda1c620968e..0bd100392889 100644
> --- a/tools/perf/util/parse-regs-options.c
> +++ b/tools/perf/util/parse-regs-options.c
> @@ -4,19 +4,139 @@
>  #include <stdint.h>
>  #include <string.h>
>  #include <stdio.h>
> +#include <linux/bitops.h>
>  #include "util/debug.h"
>  #include <subcmd/parse-options.h>
>  #include "util/perf_regs.h"
>  #include "util/parse-regs-options.h"
> +#include "record.h"
> +
> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> +{
> +       const struct sample_reg *r = NULL;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       if (!simd_mask)
> +               return;
> +
> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +               if (!(r->mask & simd_mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               if (bitmap)
> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> +       }
> +}
> +
> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> +{
> +       const struct sample_reg *r = NULL;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       if (!pred_mask)
> +               return;
> +
> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> +               if (!(r->mask & pred_mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               if (bitmap)
> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> +       }
> +}
> +
> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool matched = false;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +               if (strcasecmp(s, r->name))
> +                       continue;
> +               if (!fls64(r->mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> +               matched = true;
> +               break;
> +       }
> +
> +       /* Just need the highest qwords */
> +       if (qwords > opts->sample_vec_regs_qwords) {
> +               opts->sample_vec_regs_qwords = qwords;
> +               if (intr)
> +                       opts->sample_intr_vec_regs = bitmap;
> +               else
> +                       opts->sample_user_vec_regs = bitmap;
> +       }
> +
> +       return matched;
> +}
> +
> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> +{
> +       const struct sample_reg *r = NULL;
> +       bool matched = false;
> +       uint64_t bitmap = 0;
> +       u16 qwords = 0;
> +       int reg_idx;
> +
> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> +               if (strcasecmp(s, r->name))
> +                       continue;
> +               if (!fls64(r->mask))
> +                       continue;
> +               reg_idx = fls64(r->mask) - 1;
> +               if (intr)
> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               else
> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> +               matched = true;
> +               break;
> +       }
> +
> +       /* Just need the highest qwords */
> +       if (qwords > opts->sample_pred_regs_qwords) {
> +               opts->sample_pred_regs_qwords = qwords;
> +               if (intr)
> +                       opts->sample_intr_pred_regs = bitmap;
> +               else
> +                       opts->sample_user_pred_regs = bitmap;
> +       }
> +
> +       return matched;
> +}
>
>  static int
>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>  {
>         uint64_t *mode = (uint64_t *)opt->value;
>         const struct sample_reg *r = NULL;
> +       struct record_opts *opts;
>         char *s, *os = NULL, *p;
> -       int ret = -1;
> +       bool has_simd_regs = false;
>         uint64_t mask;
> +       uint64_t simd_mask;
> +       uint64_t pred_mask;
> +       int ret = -1;
>
>         if (unset)
>                 return 0;
> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>         if (*mode)
>                 return -1;
>
> -       if (intr)
> +       if (intr) {
> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>                 mask = arch__intr_reg_mask();
> -       else
> +               simd_mask = arch__intr_simd_reg_mask();
> +               pred_mask = arch__intr_pred_reg_mask();
> +       } else {
> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>                 mask = arch__user_reg_mask();
> +               simd_mask = arch__user_simd_reg_mask();
> +               pred_mask = arch__user_pred_reg_mask();
> +       }
>
>         /* str may be NULL in case no arg is passed to -I */
>         if (str) {
> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>                                         if (r->mask & mask)
>                                                 fprintf(stderr, "%s ", r->name);
>                                 }
> +                               __print_simd_regs(intr, simd_mask);
> +                               __print_pred_regs(intr, pred_mask);
>                                 fputc('\n', stderr);
>                                 /* just printing available regs */
>                                 goto error;
>                         }
> +
> +                       if (simd_mask) {
> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> +                               if (has_simd_regs)
> +                                       goto next;
> +                       }
> +                       if (pred_mask) {
> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> +                               if (has_simd_regs)
> +                                       goto next;
> +                       }
> +
>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>                                         break;
> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>                         }
>
>                         *mode |= r->mask;
> -
> +next:
>                         if (!p)
>                                 break;
>
> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>         ret = 0;
>
>         /* default to all possible regs */
> -       if (*mode == 0)
> +       if (*mode == 0 && !has_simd_regs)
>                 *mode = mask;
>  error:
>         free(os);
> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> index 66b666d9ce64..fb0366d050cf 100644
> --- a/tools/perf/util/perf_event_attr_fprintf.c
> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>         PRINT_ATTRf(aux_pause, p_unsigned);
>         PRINT_ATTRf(aux_resume, p_unsigned);
> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>
>         return ret;
>  }
> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> index 44b90bbf2d07..e8a9fabc92e6 100644
> --- a/tools/perf/util/perf_regs.c
> +++ b/tools/perf/util/perf_regs.c
> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>         return SDT_ARG_SKIP;
>  }
>
> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> +{
> +       return false;
> +}
> +
>  uint64_t __weak arch__intr_reg_mask(void)
>  {
>         return 0;
> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>         return 0;
>  }
>
> +uint64_t __weak arch__intr_simd_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_simd_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__intr_pred_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_pred_reg_mask(void)
> +{
> +       return 0;
> +}
> +
> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> +{
> +       *qwords = 0;
> +       return 0;
> +}
> +
>  static const struct sample_reg sample_reg_masks[] = {
>         SMPL_REG_END
>  };
> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>         return sample_reg_masks;
>  }
>
> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> +{
> +       return sample_reg_masks;
> +}
> +
> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> +{
> +       return sample_reg_masks;
> +}
> +
>  const char *perf_reg_name(int id, const char *arch)
>  {
>         const char *reg_name = NULL;
> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> index f2d0736d65cc..bce9c4cfd1bf 100644
> --- a/tools/perf/util/perf_regs.h
> +++ b/tools/perf/util/perf_regs.h
> @@ -24,9 +24,20 @@ enum {
>  };
>
>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> +bool arch_has_simd_regs(u64 mask);
>  uint64_t arch__intr_reg_mask(void);
>  uint64_t arch__user_reg_mask(void);
>  const struct sample_reg *arch__sample_reg_masks(void);
> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> +const struct sample_reg *arch__sample_pred_reg_masks(void);

I wonder we can remove these functions. perf_reg_name(int id, uint16_t
e_machine) maps a perf register number and e_machine to a string. So
the sample_reg array could be replaced with:
```
for (int perf_reg = 0; perf_reg < 64; perf_reg++) {
  uint64_t mask = 1LL << perf_reg;
  const char *name = perf_reg_name(perf_reg, EM_HOST);
  if (name == NULL)
    break;
  // use mask and name
```
To make it work for SIMD and PRED then I guess we need to iterate
through the ABIs of enum perf_sample_regs_abi.

> +uint64_t arch__intr_simd_reg_mask(void);
> +uint64_t arch__user_simd_reg_mask(void);
> +uint64_t arch__intr_pred_reg_mask(void);
> +uint64_t arch__user_pred_reg_mask(void);

I think some comments would be useful here like:
```
/* Perf register bit map with valid bits for
perf_event_attr.sample_regs_user. */
uint64_t arch__intr_reg_mask(void);
/* Perf register bit map with valid bits for
perf_event_attr.sample_regs_intr. */
uint64_t arch__user_reg_mask(void);
/* Perf register bit map with valid bits for
perf_event_attr.sample_simd_vec_reg_intr. */
uint64_t arch__intr_simd_reg_mask(void);
/* Perf register bit map with valid bits for
perf_event_attr.sample_simd_vec_reg_user. */
uint64_t arch__user_simd_reg_mask(void);
/* Perf register bit map with valid bits for
perf_event_attr.sample_simd_pred_reg_intr. */
uint64_t arch__intr_pred_reg_mask(void);
/* Perf register bit map with valid bits for
perf_event_attr.sample_simd_pred_reg_user. */
uint64_t arch__user_pred_reg_mask(void);
```

Why do the arch__user_pred_reg_mask return a uint64_t when the
perf_event_attr variable is a __u32?

> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);

I don't understand this function. The qwords is specific to a
perf_event_attr. We could have an evlist with an evsel set up to
sample say XMM registers and another evsel set up to sample ZMM
registers. Are the qwords here always for the ZMM case, or is XMM,
YMM, ZMM depending on architecture support? Why does it vary per
register? The surrounding code uses the term mask but here bitmap is
used, is the inconsistency deliberate? Why are there user and intr
functions when in the perf_event_attr there are only
sample_simd_pred_reg_qwords and sample_simd_ved_reg_qwords variables?

Perhaps these functions should be something more like:
```
/* Maximum value that can be assigned to
perf_event_atttr.sample_simd_pred_reg_qwords. */
uint16_t arch__simd_pred_reg_qwords_max(void);
/* Maximum value that can be assigned to
perf_event_atttr.sample_simd_vec_reg_qwords. */
uint16_t arch__simd_vec_reg_qwords_max(void);
```
Then the bitmap computation logic can all be moved into parse-regs-options.c.

Thanks,
Ian

>  const char *perf_reg_name(int id, const char *arch);
>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> index ea3a6c4657ee..825ffb4cc53f 100644
> --- a/tools/perf/util/record.h
> +++ b/tools/perf/util/record.h
> @@ -59,7 +59,13 @@ struct record_opts {
>         unsigned int  user_freq;
>         u64           branch_stack;
>         u64           sample_intr_regs;
> +       u64           sample_intr_vec_regs;
>         u64           sample_user_regs;
> +       u64           sample_user_vec_regs;
> +       u16           sample_pred_regs_qwords;
> +       u16           sample_vec_regs_qwords;
> +       u16           sample_intr_pred_regs;
> +       u16           sample_user_pred_regs;
>         u64           default_interval;
>         u64           user_interval;
>         size_t        auxtrace_snapshot_size;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2026-01-20  7:16   ` Ian Rogers
@ 2026-01-20  7:43     ` Mi, Dapeng
  2026-01-20  8:00       ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-20  7:43 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/20/2026 3:16 PM, Ian Rogers wrote:
> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Update include/uapi/linux/perf_event.h and
>> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
>>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
>>  2 files changed, 103 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
>> index 7c9d2bb3833b..f3561ed10041 100644
>> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
>> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
>> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
>>         PERF_REG_X86_R13,
>>         PERF_REG_X86_R14,
>>         PERF_REG_X86_R15,
>> +       /*
>> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
>> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
>> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
>> +        *
>> +        * Extended GPRs (EGPRs)
>> +        */
>> +       PERF_REG_X86_R16,
>> +       PERF_REG_X86_R17,
>> +       PERF_REG_X86_R18,
>> +       PERF_REG_X86_R19,
>> +       PERF_REG_X86_R20,
>> +       PERF_REG_X86_R21,
>> +       PERF_REG_X86_R22,
>> +       PERF_REG_X86_R23,
>> +       PERF_REG_X86_R24,
>> +       PERF_REG_X86_R25,
>> +       PERF_REG_X86_R26,
>> +       PERF_REG_X86_R27,
>> +       PERF_REG_X86_R28,
>> +       PERF_REG_X86_R29,
>> +       PERF_REG_X86_R30,
>> +       PERF_REG_X86_R31,
>> +       PERF_REG_X86_SSP,
>>         /* These are the limits for the GPRs. */
>>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
>> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
>>
>>         /* These all need two bits set because they are 128bit */
>>         PERF_REG_X86_XMM0  = 32,
>> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
>>  };
>>
>>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
>> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
>> +
>> +enum {
>> +       PERF_REG_X86_XMM,
>> +       PERF_REG_X86_YMM,
>> +       PERF_REG_X86_ZMM,
>> +       PERF_REG_X86_MAX_SIMD_REGS,
>> +
>> +       PERF_REG_X86_OPMASK = 0,
>> +       PERF_REG_X86_MAX_PRED_REGS = 1,
>> +};
>> +
>> +enum {
>> +       PERF_X86_SIMD_XMM_REGS      = 16,
>> +       PERF_X86_SIMD_YMM_REGS      = 16,
>> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
>> +       PERF_X86_SIMD_ZMM_REGS      = 32,
>> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
>> +
>> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
>> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
>> +};
>> +
>> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
>> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
>> +
>> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
>> +
>> +enum {
>> +       PERF_X86_OPMASK_QWORDS   = 1,
>> +       PERF_X86_XMM_QWORDS      = 2,
>> +       PERF_X86_YMMH_QWORDS     = 2,
>> +       PERF_X86_YMM_QWORDS      = 4,
>> +       PERF_X86_ZMMH_QWORDS     = 4,
>> +       PERF_X86_ZMM_QWORDS      = 8,
>> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
>> +};
>>
>>  #endif /* _ASM_X86_PERF_REGS_H */
>> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
>> index d292f96bc06f..f1474da32622 100644
>> --- a/tools/include/uapi/linux/perf_event.h
>> +++ b/tools/include/uapi/linux/perf_event.h
>> @@ -314,8 +314,9 @@ enum {
>>   */
>>  enum perf_sample_regs_abi {
>>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
>> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
>> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
>> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
>> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
>> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
>>  };
>>
>>  /*
>> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
>>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
>>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
>> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
>>
>>  /*
>>   * 'struct perf_event_attr' contains various attributes that define
>> @@ -545,6 +547,25 @@ struct perf_event_attr {
>>         __u64   sig_data;
>>
>>         __u64   config3; /* extension of config2 */
>> +
>> +
>> +       /*
>> +        * Defines set of SIMD registers to dump on samples.
>> +        * The sample_simd_regs_enabled !=0 implies the
>> +        * set of SIMD registers is used to config all SIMD registers.
>> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
>> +        * config some SIMD registers on X86.
>> +        */
>> +       union {
>> +               __u16 sample_simd_regs_enabled;
>> +               __u16 sample_simd_pred_reg_qwords;
>> +       };
>> +       __u32 sample_simd_pred_reg_intr;
>> +       __u32 sample_simd_pred_reg_user;
>> +       __u16 sample_simd_vec_reg_qwords;
>> +       __u64 sample_simd_vec_reg_intr;
>> +       __u64 sample_simd_vec_reg_user;
>> +       __u32 __reserved_4;
>>  };
>>
>>  /*
>> @@ -1018,7 +1039,15 @@ enum perf_event_type {
>>          *      } && PERF_SAMPLE_BRANCH_STACK
>>          *
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> Why can't these values be taken from the perf_event_attr? The abi is
> needed as there could be both 32-bit and 64-bit samples for the same
> event - presumably x32 appears as 64-bit. If the ABI has SIMD within
> it (implied by the "} && (abi & PERF_SAMPLE_REGS_ABI_SIMD)" below)
> then why can't we just use the perf_event_attr values? For example,
> data could be "data[weight(sample_simd_vec_reg_user) *
> sample_simd_vec_reg_qwords + weight(sample_simd_pred_reg_user) *
> sample_simd_pred_reg_qwords]".

The main reason is that the sampled SIMD regs could only be a subset of the
requested SIMD regs in perf_event_attr, so we need to show the bitmask and
qwords length explicitly in the sample record. 


>
>> +        *      } && PERF_SAMPLE_REGS_USER
>>          *
>>          *      { u64                   size;
>>          *        char                  data[size];
>> @@ -1045,7 +1074,15 @@ enum perf_event_type {
>>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>>          *      { u64                   abi; # enum perf_sample_regs_abi
>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>> +        *        u64                   regs[weight(mask)];
>> +        *        struct {
>> +        *              u16 nr_vectors;
>> +        *              u16 vector_qwords;
>> +        *              u16 nr_pred;
>> +        *              u16 pred_qwords;
>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> Same comment.
>
> Thanks,
> Ian
>
>> +        *      } && PERF_SAMPLE_REGS_INTR
>>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2026-01-20  7:43     ` Mi, Dapeng
@ 2026-01-20  8:00       ` Ian Rogers
  2026-01-20  9:22         ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-20  8:00 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Mon, Jan 19, 2026 at 11:43 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 1/20/2026 3:16 PM, Ian Rogers wrote:
> > On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> Update include/uapi/linux/perf_event.h and
> >> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
> >>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> ---
> >>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
> >>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
> >>  2 files changed, 103 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
> >> index 7c9d2bb3833b..f3561ed10041 100644
> >> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
> >> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
> >> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
> >>         PERF_REG_X86_R13,
> >>         PERF_REG_X86_R14,
> >>         PERF_REG_X86_R15,
> >> +       /*
> >> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
> >> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
> >> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
> >> +        *
> >> +        * Extended GPRs (EGPRs)
> >> +        */
> >> +       PERF_REG_X86_R16,
> >> +       PERF_REG_X86_R17,
> >> +       PERF_REG_X86_R18,
> >> +       PERF_REG_X86_R19,
> >> +       PERF_REG_X86_R20,
> >> +       PERF_REG_X86_R21,
> >> +       PERF_REG_X86_R22,
> >> +       PERF_REG_X86_R23,
> >> +       PERF_REG_X86_R24,
> >> +       PERF_REG_X86_R25,
> >> +       PERF_REG_X86_R26,
> >> +       PERF_REG_X86_R27,
> >> +       PERF_REG_X86_R28,
> >> +       PERF_REG_X86_R29,
> >> +       PERF_REG_X86_R30,
> >> +       PERF_REG_X86_R31,
> >> +       PERF_REG_X86_SSP,
> >>         /* These are the limits for the GPRs. */
> >>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
> >>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
> >> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
> >>
> >>         /* These all need two bits set because they are 128bit */
> >>         PERF_REG_X86_XMM0  = 32,
> >> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
> >>  };
> >>
> >>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
> >> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
> >> +
> >> +enum {
> >> +       PERF_REG_X86_XMM,
> >> +       PERF_REG_X86_YMM,
> >> +       PERF_REG_X86_ZMM,
> >> +       PERF_REG_X86_MAX_SIMD_REGS,
> >> +
> >> +       PERF_REG_X86_OPMASK = 0,
> >> +       PERF_REG_X86_MAX_PRED_REGS = 1,
> >> +};
> >> +
> >> +enum {
> >> +       PERF_X86_SIMD_XMM_REGS      = 16,
> >> +       PERF_X86_SIMD_YMM_REGS      = 16,
> >> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
> >> +       PERF_X86_SIMD_ZMM_REGS      = 32,
> >> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
> >> +
> >> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
> >> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
> >> +};
> >> +
> >> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
> >> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
> >> +
> >> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
> >> +
> >> +enum {
> >> +       PERF_X86_OPMASK_QWORDS   = 1,
> >> +       PERF_X86_XMM_QWORDS      = 2,
> >> +       PERF_X86_YMMH_QWORDS     = 2,
> >> +       PERF_X86_YMM_QWORDS      = 4,
> >> +       PERF_X86_ZMMH_QWORDS     = 4,
> >> +       PERF_X86_ZMM_QWORDS      = 8,
> >> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
> >> +};
> >>
> >>  #endif /* _ASM_X86_PERF_REGS_H */
> >> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> >> index d292f96bc06f..f1474da32622 100644
> >> --- a/tools/include/uapi/linux/perf_event.h
> >> +++ b/tools/include/uapi/linux/perf_event.h
> >> @@ -314,8 +314,9 @@ enum {
> >>   */
> >>  enum perf_sample_regs_abi {
> >>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
> >> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
> >> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
> >> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
> >> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
> >> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
> >>  };
> >>
> >>  /*
> >> @@ -382,6 +383,7 @@ enum perf_event_read_format {
> >>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
> >>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
> >>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
> >> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
> >>
> >>  /*
> >>   * 'struct perf_event_attr' contains various attributes that define
> >> @@ -545,6 +547,25 @@ struct perf_event_attr {
> >>         __u64   sig_data;
> >>
> >>         __u64   config3; /* extension of config2 */
> >> +
> >> +
> >> +       /*
> >> +        * Defines set of SIMD registers to dump on samples.
> >> +        * The sample_simd_regs_enabled !=0 implies the
> >> +        * set of SIMD registers is used to config all SIMD registers.
> >> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> >> +        * config some SIMD registers on X86.
> >> +        */
> >> +       union {
> >> +               __u16 sample_simd_regs_enabled;
> >> +               __u16 sample_simd_pred_reg_qwords;
> >> +       };
> >> +       __u32 sample_simd_pred_reg_intr;
> >> +       __u32 sample_simd_pred_reg_user;
> >> +       __u16 sample_simd_vec_reg_qwords;
> >> +       __u64 sample_simd_vec_reg_intr;
> >> +       __u64 sample_simd_vec_reg_user;
> >> +       __u32 __reserved_4;
> >>  };
> >>
> >>  /*
> >> @@ -1018,7 +1039,15 @@ enum perf_event_type {
> >>          *      } && PERF_SAMPLE_BRANCH_STACK
> >>          *
> >>          *      { u64                   abi; # enum perf_sample_regs_abi
> >> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> >> +        *        u64                   regs[weight(mask)];
> >> +        *        struct {
> >> +        *              u16 nr_vectors;
> >> +        *              u16 vector_qwords;
> >> +        *              u16 nr_pred;
> >> +        *              u16 pred_qwords;
> >> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> >> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> > Why can't these values be taken from the perf_event_attr? The abi is
> > needed as there could be both 32-bit and 64-bit samples for the same
> > event - presumably x32 appears as 64-bit. If the ABI has SIMD within
> > it (implied by the "} && (abi & PERF_SAMPLE_REGS_ABI_SIMD)" below)
> > then why can't we just use the perf_event_attr values? For example,
> > data could be "data[weight(sample_simd_vec_reg_user) *
> > sample_simd_vec_reg_qwords + weight(sample_simd_pred_reg_user) *
> > sample_simd_pred_reg_qwords]".
>
> The main reason is that the sampled SIMD regs could only be a subset of the
> requested SIMD regs in perf_event_attr, so we need to show the bitmask and
> qwords length explicitly in the sample record.

But this doesn't happen in any other register sampling, why in this case?

Perhaps add comments along the lines:
u16 nr_vectors;  // weight(sample_simd_vec_reg_user) except when ...

My random guess as to why the value differs from the weight would be
some kind of optimization around register values of 0. And even if the
number of registers is reduced, why is the number of qwords being
altered?

Thanks,
Ian

> >
> >> +        *      } && PERF_SAMPLE_REGS_USER
> >>          *
> >>          *      { u64                   size;
> >>          *        char                  data[size];
> >> @@ -1045,7 +1074,15 @@ enum perf_event_type {
> >>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
> >>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
> >>          *      { u64                   abi; # enum perf_sample_regs_abi
> >> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> >> +        *        u64                   regs[weight(mask)];
> >> +        *        struct {
> >> +        *              u16 nr_vectors;
> >> +        *              u16 vector_qwords;
> >> +        *              u16 nr_pred;
> >> +        *              u16 pred_qwords;
> >> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> >> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> > Same comment.
> >
> > Thanks,
> > Ian
> >
> >> +        *      } && PERF_SAMPLE_REGS_INTR
> >>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
> >>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
> >>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
> >> --
> >> 2.34.1
> >>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-20  7:39   ` Ian Rogers
@ 2026-01-20  9:04     ` Mi, Dapeng
  2026-01-20 18:20       ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-20  9:04 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/20/2026 3:39 PM, Ian Rogers wrote:
> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> This patch adds support for the newly introduced SIMD register sampling
>> format by adding the following functions:
>>
>> uint64_t arch__intr_simd_reg_mask(void);
>> uint64_t arch__user_simd_reg_mask(void);
>> uint64_t arch__intr_pred_reg_mask(void);
>> uint64_t arch__user_pred_reg_mask(void);
>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>
>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>
>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>> supported PRED registers, such as OPMASK on x86 platforms.
>>
>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>> exact bitmap and number of qwords for a specific type of SIMD register.
>> For example, for XMM registers on x86 platforms, the returned bitmap is
>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>
>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>> exact bitmap and number of qwords for a specific type of PRED register.
>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>> OPMASK).
>>
>> Additionally, the function __parse_regs() is enhanced to support parsing
>> these newly introduced SIMD registers. Currently, each type of register
>> can only be sampled collectively; sampling a specific SIMD register is
>> not supported. For example, all XMM registers are sampled together rather
>> than sampling only XMM0.
>>
>> When multiple overlapping register types, such as XMM and YMM, are
>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>
>> With this patch, all supported sampling registers on x86 platforms are
>> displayed as follows.
>>
>>  $perf record -I?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>
>>  $perf record --user-regs=?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>  tools/perf/util/evsel.c                   |  27 ++
>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>  tools/perf/util/perf_regs.c               |  59 +++
>>  tools/perf/util/perf_regs.h               |  11 +
>>  tools/perf/util/record.h                  |   6 +
>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>
>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>> index 12fd93f04802..db41430f3b07 100644
>> --- a/tools/perf/arch/x86/util/perf_regs.c
>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>> @@ -13,6 +13,49 @@
>>  #include "../../../util/pmu.h"
>>  #include "../../../util/pmus.h"
>>
>> +static const struct sample_reg sample_reg_masks_ext[] = {
>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>> +#endif
>> +       SMPL_REG_END
>> +};
>> +
>>  static const struct sample_reg sample_reg_masks[] = {
>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>         SMPL_REG(BX, PERF_REG_X86_BX),
>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>         return SDT_ARG_VALID;
>>  }
>>
>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>> +{
>> +       struct perf_event_attr attr = {
>> +               .type                           = PERF_TYPE_HARDWARE,
>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>> +               .sample_type                    = sample_type,
>> +               .disabled                       = 1,
>> +               .exclude_kernel                 = 1,
>> +               .sample_simd_regs_enabled       = 1,
>> +       };
>> +       int fd;
>> +
>> +       attr.sample_period = 1;
>> +
>> +       if (!pred) {
>> +               attr.sample_simd_vec_reg_qwords = qwords;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       attr.sample_simd_vec_reg_intr = mask;
>> +               else
>> +                       attr.sample_simd_vec_reg_user = mask;
>> +       } else {
>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>> +               else
>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>> +       }
>> +
>> +       if (perf_pmus__num_core_pmus() > 1) {
>> +               struct perf_pmu *pmu = NULL;
>> +               __u64 type = PERF_TYPE_RAW;
>> +
>> +               /*
>> +                * The same register set is supported among different hybrid PMUs.
>> +                * Only check the first available one.
>> +                */
>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>> +                       type = pmu->type;
>> +                       break;
>> +               }
>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>> +       }
>> +
>> +       event_attr_init(&attr);
>> +
>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>> +       if (fd != -1) {
>> +               close(fd);
>> +               return true;
>> +       }
>> +
>> +       return false;
>> +}
>> +
>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>> +{
>> +       bool supported = false;
>> +       u64 bits;
>> +
>> +       *mask = 0;
>> +       *qwords = 0;
>> +
>> +       switch (reg) {
>> +       case PERF_REG_X86_XMM:
>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_XMM_QWORDS;
>> +               }
>> +               break;
>> +       case PERF_REG_X86_YMM:
>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_YMM_QWORDS;
>> +               }
>> +               break;
>> +       case PERF_REG_X86_ZMM:
>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>> +                       break;
>> +               }
>> +
>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>> +               }
>> +               break;
>> +       default:
>> +               break;
>> +       }
>> +
>> +       return supported;
>> +}
>> +
>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>> +{
>> +       bool supported = false;
>> +       u64 bits;
>> +
>> +       *mask = 0;
>> +       *qwords = 0;
>> +
>> +       switch (reg) {
>> +       case PERF_REG_X86_OPMASK:
>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>> +               if (supported) {
>> +                       *mask = bits;
>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>> +               }
>> +               break;
>> +       default:
>> +               break;
>> +       }
>> +
>> +       return supported;
>> +}
>> +
>> +static bool has_cap_simd_regs(void)
>> +{
>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>> +       static bool has_cap_simd_regs;
>> +       static bool cached;
>> +
>> +       if (cached)
>> +               return has_cap_simd_regs;
>> +
>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>> +       cached = true;
>> +
>> +       return has_cap_simd_regs;
>> +}
>> +
>> +bool arch_has_simd_regs(u64 mask)
>> +{
>> +       return has_cap_simd_regs() &&
>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>> +}
>> +
>> +static const struct sample_reg sample_simd_reg_masks[] = {
>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>> +       SMPL_REG_END
>> +};
>> +
>> +static const struct sample_reg sample_pred_reg_masks[] = {
>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>> +       SMPL_REG_END
>> +};
>> +
>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>> +{
>> +       return sample_simd_reg_masks;
>> +}
>> +
>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>> +{
>> +       return sample_pred_reg_masks;
>> +}
>> +
>> +static bool x86_intr_simd_updated;
>> +static u64 x86_intr_simd_reg_mask;
>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>> +static bool x86_user_simd_updated;
>> +static u64 x86_user_simd_reg_mask;
>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>> +
>> +static bool x86_intr_pred_updated;
>> +static u64 x86_intr_pred_reg_mask;
>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>> +static bool x86_user_pred_updated;
>> +static u64 x86_user_pred_reg_mask;
>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>> +
>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool supported;
>> +       u64 mask = 0;
>> +       int reg;
>> +
>> +       if (!has_cap_simd_regs())
>> +               return 0;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>> +               return x86_intr_simd_reg_mask;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>> +               return x86_user_simd_reg_mask;
>> +
>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>> +               supported = false;
>> +
>> +               if (!r->mask)
>> +                       continue;
>> +               reg = fls64(r->mask) - 1;
>> +
>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>> +                       break;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>> +                                                        &x86_intr_simd_mask[reg],
>> +                                                        &x86_intr_simd_qwords[reg]);
>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>> +                                                        &x86_user_simd_mask[reg],
>> +                                                        &x86_user_simd_qwords[reg]);
>> +               if (supported)
>> +                       mask |= BIT_ULL(reg);
>> +       }
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>> +               x86_intr_simd_reg_mask = mask;
>> +               x86_intr_simd_updated = true;
>> +       } else {
>> +               x86_user_simd_reg_mask = mask;
>> +               x86_user_simd_updated = true;
>> +       }
>> +
>> +       return mask;
>> +}
>> +
>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool supported;
>> +       u64 mask = 0;
>> +       int reg;
>> +
>> +       if (!has_cap_simd_regs())
>> +               return 0;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>> +               return x86_intr_pred_reg_mask;
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>> +               return x86_user_pred_reg_mask;
>> +
>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>> +               supported = false;
>> +
>> +               if (!r->mask)
>> +                       continue;
>> +               reg = fls64(r->mask) - 1;
>> +
>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>> +                       break;
>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>> +                                                        &x86_intr_pred_mask[reg],
>> +                                                        &x86_intr_pred_qwords[reg]);
>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>> +                                                        &x86_user_pred_mask[reg],
>> +                                                        &x86_user_pred_qwords[reg]);
>> +               if (supported)
>> +                       mask |= BIT_ULL(reg);
>> +       }
>> +
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>> +               x86_intr_pred_reg_mask = mask;
>> +               x86_intr_pred_updated = true;
>> +       } else {
>> +               x86_user_pred_reg_mask = mask;
>> +               x86_user_pred_updated = true;
>> +       }
>> +
>> +       return mask;
>> +}
>> +
>> +uint64_t arch__intr_simd_reg_mask(void)
>> +{
>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>> +}
>> +
>> +uint64_t arch__user_simd_reg_mask(void)
>> +{
>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>> +}
>> +
>> +uint64_t arch__intr_pred_reg_mask(void)
>> +{
>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>> +}
>> +
>> +uint64_t arch__user_pred_reg_mask(void)
>> +{
>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>> +}
>> +
>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>> +{
>> +       uint64_t mask = 0;
>> +
>> +       *qwords = 0;
>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>> +               if (intr) {
>> +                       *qwords = x86_intr_simd_qwords[reg];
>> +                       mask = x86_intr_simd_mask[reg];
>> +               } else {
>> +                       *qwords = x86_user_simd_qwords[reg];
>> +                       mask = x86_user_simd_mask[reg];
>> +               }
>> +       }
>> +
>> +       return mask;
>> +}
>> +
>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>> +{
>> +       uint64_t mask = 0;
>> +
>> +       *qwords = 0;
>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>> +               if (intr) {
>> +                       *qwords = x86_intr_pred_qwords[reg];
>> +                       mask = x86_intr_pred_mask[reg];
>> +               } else {
>> +                       *qwords = x86_user_pred_qwords[reg];
>> +                       mask = x86_user_pred_mask[reg];
>> +               }
>> +       }
>> +
>> +       return mask;
>> +}
>> +
>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_intr_simd_updated)
>> +               arch__intr_simd_reg_mask();
>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>> +}
>> +
>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_user_simd_updated)
>> +               arch__user_simd_reg_mask();
>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>> +}
>> +
>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_intr_pred_updated)
>> +               arch__intr_pred_reg_mask();
>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>> +}
>> +
>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>> +{
>> +       if (!x86_user_pred_updated)
>> +               arch__user_pred_reg_mask();
>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>> +}
>> +
>>  const struct sample_reg *arch__sample_reg_masks(void)
>>  {
>> +       if (has_cap_simd_regs())
>> +               return sample_reg_masks_ext;
>>         return sample_reg_masks;
>>  }
>>
>> -uint64_t arch__intr_reg_mask(void)
>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>  {
>>         struct perf_event_attr attr = {
>> -               .type                   = PERF_TYPE_HARDWARE,
>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>> -               .precise_ip             = 1,
>> -               .disabled               = 1,
>> -               .exclude_kernel         = 1,
>> +               .type                           = PERF_TYPE_HARDWARE,
>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>> +               .sample_type                    = sample_type,
>> +               .precise_ip                     = 1,
>> +               .disabled                       = 1,
>> +               .exclude_kernel                 = 1,
>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>         };
>>         int fd;
>>         /*
>>          * In an unnamed union, init it here to build on older gcc versions
>>          */
>>         attr.sample_period = 1;
>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +               attr.sample_regs_intr = mask;
>> +       else
>> +               attr.sample_regs_user = mask;
>>
>>         if (perf_pmus__num_core_pmus() > 1) {
>>                 struct perf_pmu *pmu = NULL;
>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>         if (fd != -1) {
>>                 close(fd);
>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>> +               return mask;
>>         }
>>
>> -       return PERF_REGS_MASK;
>> +       return 0;
>> +}
>> +
>> +uint64_t arch__intr_reg_mask(void)
>> +{
>> +       uint64_t mask = PERF_REGS_MASK;
>> +
>> +       if (has_cap_simd_regs()) {
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>> +                                        true);
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>> +                                        true);
>> +       } else
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>> +
>> +       return mask;
>>  }
>>
>>  uint64_t arch__user_reg_mask(void)
>>  {
>> -       return PERF_REGS_MASK;
>> +       uint64_t mask = PERF_REGS_MASK;
>> +
>> +       if (has_cap_simd_regs()) {
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>> +                                        true);
>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>> +                                        true);
>> +       }
>> +
>> +       return mask;
>>  }
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index 56ebefd075f2..5d1d90cf9488 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>             !evsel__is_dummy_event(evsel)) {
>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>> +       }
>> +
>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>> +               if (opts->sample_pred_regs_qwords)
>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>> +               else
>> +                       attr->sample_simd_pred_reg_qwords = 1;
>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>         }
>>
>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>             !evsel__is_dummy_event(evsel)) {
>>                 attr->sample_regs_user |= opts->sample_user_regs;
>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>> +               evsel__set_sample_bit(evsel, REGS_USER);
>> +       }
>> +
>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>> +               if (opts->sample_pred_regs_qwords)
>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>> +               else
>> +                       attr->sample_simd_pred_reg_qwords = 1;
>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>         }
>>
>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>> index cda1c620968e..0bd100392889 100644
>> --- a/tools/perf/util/parse-regs-options.c
>> +++ b/tools/perf/util/parse-regs-options.c
>> @@ -4,19 +4,139 @@
>>  #include <stdint.h>
>>  #include <string.h>
>>  #include <stdio.h>
>> +#include <linux/bitops.h>
>>  #include "util/debug.h"
>>  #include <subcmd/parse-options.h>
>>  #include "util/perf_regs.h"
>>  #include "util/parse-regs-options.h"
>> +#include "record.h"
>> +
>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       if (!simd_mask)
>> +               return;
>> +
>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>> +               if (!(r->mask & simd_mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               if (bitmap)
>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>> +       }
>> +}
>> +
>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       if (!pred_mask)
>> +               return;
>> +
>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>> +               if (!(r->mask & pred_mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               if (bitmap)
>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>> +       }
>> +}
>> +
>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool matched = false;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>> +               if (strcasecmp(s, r->name))
>> +                       continue;
>> +               if (!fls64(r->mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>> +               matched = true;
>> +               break;
>> +       }
>> +
>> +       /* Just need the highest qwords */
>> +       if (qwords > opts->sample_vec_regs_qwords) {
>> +               opts->sample_vec_regs_qwords = qwords;
>> +               if (intr)
>> +                       opts->sample_intr_vec_regs = bitmap;
>> +               else
>> +                       opts->sample_user_vec_regs = bitmap;
>> +       }
>> +
>> +       return matched;
>> +}
>> +
>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>> +{
>> +       const struct sample_reg *r = NULL;
>> +       bool matched = false;
>> +       uint64_t bitmap = 0;
>> +       u16 qwords = 0;
>> +       int reg_idx;
>> +
>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>> +               if (strcasecmp(s, r->name))
>> +                       continue;
>> +               if (!fls64(r->mask))
>> +                       continue;
>> +               reg_idx = fls64(r->mask) - 1;
>> +               if (intr)
>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               else
>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>> +               matched = true;
>> +               break;
>> +       }
>> +
>> +       /* Just need the highest qwords */
>> +       if (qwords > opts->sample_pred_regs_qwords) {
>> +               opts->sample_pred_regs_qwords = qwords;
>> +               if (intr)
>> +                       opts->sample_intr_pred_regs = bitmap;
>> +               else
>> +                       opts->sample_user_pred_regs = bitmap;
>> +       }
>> +
>> +       return matched;
>> +}
>>
>>  static int
>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>  {
>>         uint64_t *mode = (uint64_t *)opt->value;
>>         const struct sample_reg *r = NULL;
>> +       struct record_opts *opts;
>>         char *s, *os = NULL, *p;
>> -       int ret = -1;
>> +       bool has_simd_regs = false;
>>         uint64_t mask;
>> +       uint64_t simd_mask;
>> +       uint64_t pred_mask;
>> +       int ret = -1;
>>
>>         if (unset)
>>                 return 0;
>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>         if (*mode)
>>                 return -1;
>>
>> -       if (intr)
>> +       if (intr) {
>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>                 mask = arch__intr_reg_mask();
>> -       else
>> +               simd_mask = arch__intr_simd_reg_mask();
>> +               pred_mask = arch__intr_pred_reg_mask();
>> +       } else {
>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>                 mask = arch__user_reg_mask();
>> +               simd_mask = arch__user_simd_reg_mask();
>> +               pred_mask = arch__user_pred_reg_mask();
>> +       }
>>
>>         /* str may be NULL in case no arg is passed to -I */
>>         if (str) {
>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>                                         if (r->mask & mask)
>>                                                 fprintf(stderr, "%s ", r->name);
>>                                 }
>> +                               __print_simd_regs(intr, simd_mask);
>> +                               __print_pred_regs(intr, pred_mask);
>>                                 fputc('\n', stderr);
>>                                 /* just printing available regs */
>>                                 goto error;
>>                         }
>> +
>> +                       if (simd_mask) {
>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>> +                               if (has_simd_regs)
>> +                                       goto next;
>> +                       }
>> +                       if (pred_mask) {
>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>> +                               if (has_simd_regs)
>> +                                       goto next;
>> +                       }
>> +
>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>                                         break;
>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>                         }
>>
>>                         *mode |= r->mask;
>> -
>> +next:
>>                         if (!p)
>>                                 break;
>>
>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>         ret = 0;
>>
>>         /* default to all possible regs */
>> -       if (*mode == 0)
>> +       if (*mode == 0 && !has_simd_regs)
>>                 *mode = mask;
>>  error:
>>         free(os);
>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>> index 66b666d9ce64..fb0366d050cf 100644
>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>         PRINT_ATTRf(aux_resume, p_unsigned);
>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>
>>         return ret;
>>  }
>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>> index 44b90bbf2d07..e8a9fabc92e6 100644
>> --- a/tools/perf/util/perf_regs.c
>> +++ b/tools/perf/util/perf_regs.c
>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>         return SDT_ARG_SKIP;
>>  }
>>
>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>> +{
>> +       return false;
>> +}
>> +
>>  uint64_t __weak arch__intr_reg_mask(void)
>>  {
>>         return 0;
>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>         return 0;
>>  }
>>
>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_simd_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_pred_reg_mask(void)
>> +{
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>> +{
>> +       *qwords = 0;
>> +       return 0;
>> +}
>> +
>>  static const struct sample_reg sample_reg_masks[] = {
>>         SMPL_REG_END
>>  };
>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>         return sample_reg_masks;
>>  }
>>
>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>> +{
>> +       return sample_reg_masks;
>> +}
>> +
>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>> +{
>> +       return sample_reg_masks;
>> +}
>> +
>>  const char *perf_reg_name(int id, const char *arch)
>>  {
>>         const char *reg_name = NULL;
>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>> index f2d0736d65cc..bce9c4cfd1bf 100644
>> --- a/tools/perf/util/perf_regs.h
>> +++ b/tools/perf/util/perf_regs.h
>> @@ -24,9 +24,20 @@ enum {
>>  };
>>
>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>> +bool arch_has_simd_regs(u64 mask);
>>  uint64_t arch__intr_reg_mask(void);
>>  uint64_t arch__user_reg_mask(void);
>>  const struct sample_reg *arch__sample_reg_masks(void);
>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> I wonder we can remove these functions. perf_reg_name(int id, uint16_t
> e_machine) maps a perf register number and e_machine to a string. So
> the sample_reg array could be replaced with:
> ```
> for (int perf_reg = 0; perf_reg < 64; perf_reg++) {
>   uint64_t mask = 1LL << perf_reg;
>   const char *name = perf_reg_name(perf_reg, EM_HOST);
>   if (name == NULL)
>     break;
>   // use mask and name
> ```
> To make it work for SIMD and PRED then I guess we need to iterate
> through the ABIs of enum perf_sample_regs_abi.

Suppose so.


>
>> +uint64_t arch__intr_simd_reg_mask(void);
>> +uint64_t arch__user_simd_reg_mask(void);
>> +uint64_t arch__intr_pred_reg_mask(void);
>> +uint64_t arch__user_pred_reg_mask(void);
> I think some comments would be useful here like:
> ```
> /* Perf register bit map with valid bits for
> perf_event_attr.sample_regs_user. */
> uint64_t arch__intr_reg_mask(void);
> /* Perf register bit map with valid bits for
> perf_event_attr.sample_regs_intr. */
> uint64_t arch__user_reg_mask(void);
> /* Perf register bit map with valid bits for
> perf_event_attr.sample_simd_vec_reg_intr. */
> uint64_t arch__intr_simd_reg_mask(void);
> /* Perf register bit map with valid bits for
> perf_event_attr.sample_simd_vec_reg_user. */
> uint64_t arch__user_simd_reg_mask(void);
> /* Perf register bit map with valid bits for
> perf_event_attr.sample_simd_pred_reg_intr. */
> uint64_t arch__intr_pred_reg_mask(void);
> /* Perf register bit map with valid bits for
> perf_event_attr.sample_simd_pred_reg_user. */
> uint64_t arch__user_pred_reg_mask(void);

Sure. Thanks.


> ```
>
> Why do the arch__user_pred_reg_mask return a uint64_t when the
> perf_event_attr variable is a __u32?

Suppose it's a bug. :)


>
>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> I don't understand this function. The qwords is specific to a
> perf_event_attr. We could have an evlist with an evsel set up to
> sample say XMM registers and another evsel set up to sample ZMM
> registers. Are the qwords here always for the ZMM case, or is XMM,
> YMM, ZMM depending on architecture support? Why does it vary per
> register? The surrounding code uses the term mask but here bitmap is
> used, is the inconsistency deliberate? Why are there user and intr
> functions when in the perf_event_attr there are only
> sample_simd_pred_reg_qwords and sample_simd_ved_reg_qwords variables?

These 4 functions is designed to get the bitmask and qwords length for a
specific SIMD registers. E.g., For XMM on x86 platforms, the returned
bitmask is 0xffff (xmm0 ~ xmm15) and the qwords length is 2 (128 bits). For
ZMM on x86 platforms, if the platform only supports 16 ZMM registers, then
the returned bitmask is 0xffff (zmm0 ~ zmm15) and qwords length is 8 (512
bits). If the platform supports 32 ZMM registers, then the returned bitmask
is 0xffffffff (zmm0 ~ zmm31) and qwords length is 8 (512 bits).

Since the qword length is always fixed for any certain SIMD register
regardless of intr or user, so there is only one
sample_simd_pred_reg_qwords or sample_simd_ved_reg_qwords variable.


>
> Perhaps these functions should be something more like:
> ```
> /* Maximum value that can be assigned to
> perf_event_atttr.sample_simd_pred_reg_qwords. */
> uint16_t arch__simd_pred_reg_qwords_max(void);
> /* Maximum value that can be assigned to
> perf_event_atttr.sample_simd_vec_reg_qwords. */
> uint16_t arch__simd_vec_reg_qwords_max(void);
> ```
> Then the bitmap computation logic can all be moved into parse-regs-options.c.
>
> Thanks,
> Ian
>
>>  const char *perf_reg_name(int id, const char *arch);
>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>> index ea3a6c4657ee..825ffb4cc53f 100644
>> --- a/tools/perf/util/record.h
>> +++ b/tools/perf/util/record.h
>> @@ -59,7 +59,13 @@ struct record_opts {
>>         unsigned int  user_freq;
>>         u64           branch_stack;
>>         u64           sample_intr_regs;
>> +       u64           sample_intr_vec_regs;
>>         u64           sample_user_regs;
>> +       u64           sample_user_vec_regs;
>> +       u16           sample_pred_regs_qwords;
>> +       u16           sample_vec_regs_qwords;
>> +       u16           sample_intr_pred_regs;
>> +       u16           sample_user_pred_regs;
>>         u64           default_interval;
>>         u64           user_interval;
>>         size_t        auxtrace_snapshot_size;
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2026-01-20  8:00       ` Ian Rogers
@ 2026-01-20  9:22         ` Mi, Dapeng
  2026-01-20 18:11           ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-20  9:22 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/20/2026 4:00 PM, Ian Rogers wrote:
> On Mon, Jan 19, 2026 at 11:43 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 1/20/2026 3:16 PM, Ian Rogers wrote:
>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> Update include/uapi/linux/perf_event.h and
>>>> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
>>>>
>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>> ---
>>>>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
>>>>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
>>>>  2 files changed, 103 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
>>>> index 7c9d2bb3833b..f3561ed10041 100644
>>>> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
>>>> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
>>>> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
>>>>         PERF_REG_X86_R13,
>>>>         PERF_REG_X86_R14,
>>>>         PERF_REG_X86_R15,
>>>> +       /*
>>>> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
>>>> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
>>>> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
>>>> +        *
>>>> +        * Extended GPRs (EGPRs)
>>>> +        */
>>>> +       PERF_REG_X86_R16,
>>>> +       PERF_REG_X86_R17,
>>>> +       PERF_REG_X86_R18,
>>>> +       PERF_REG_X86_R19,
>>>> +       PERF_REG_X86_R20,
>>>> +       PERF_REG_X86_R21,
>>>> +       PERF_REG_X86_R22,
>>>> +       PERF_REG_X86_R23,
>>>> +       PERF_REG_X86_R24,
>>>> +       PERF_REG_X86_R25,
>>>> +       PERF_REG_X86_R26,
>>>> +       PERF_REG_X86_R27,
>>>> +       PERF_REG_X86_R28,
>>>> +       PERF_REG_X86_R29,
>>>> +       PERF_REG_X86_R30,
>>>> +       PERF_REG_X86_R31,
>>>> +       PERF_REG_X86_SSP,
>>>>         /* These are the limits for the GPRs. */
>>>>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>>>>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
>>>> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
>>>>
>>>>         /* These all need two bits set because they are 128bit */
>>>>         PERF_REG_X86_XMM0  = 32,
>>>> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
>>>>  };
>>>>
>>>>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
>>>> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
>>>> +
>>>> +enum {
>>>> +       PERF_REG_X86_XMM,
>>>> +       PERF_REG_X86_YMM,
>>>> +       PERF_REG_X86_ZMM,
>>>> +       PERF_REG_X86_MAX_SIMD_REGS,
>>>> +
>>>> +       PERF_REG_X86_OPMASK = 0,
>>>> +       PERF_REG_X86_MAX_PRED_REGS = 1,
>>>> +};
>>>> +
>>>> +enum {
>>>> +       PERF_X86_SIMD_XMM_REGS      = 16,
>>>> +       PERF_X86_SIMD_YMM_REGS      = 16,
>>>> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
>>>> +       PERF_X86_SIMD_ZMM_REGS      = 32,
>>>> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
>>>> +
>>>> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
>>>> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
>>>> +};
>>>> +
>>>> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
>>>> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
>>>> +
>>>> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
>>>> +
>>>> +enum {
>>>> +       PERF_X86_OPMASK_QWORDS   = 1,
>>>> +       PERF_X86_XMM_QWORDS      = 2,
>>>> +       PERF_X86_YMMH_QWORDS     = 2,
>>>> +       PERF_X86_YMM_QWORDS      = 4,
>>>> +       PERF_X86_ZMMH_QWORDS     = 4,
>>>> +       PERF_X86_ZMM_QWORDS      = 8,
>>>> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
>>>> +};
>>>>
>>>>  #endif /* _ASM_X86_PERF_REGS_H */
>>>> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
>>>> index d292f96bc06f..f1474da32622 100644
>>>> --- a/tools/include/uapi/linux/perf_event.h
>>>> +++ b/tools/include/uapi/linux/perf_event.h
>>>> @@ -314,8 +314,9 @@ enum {
>>>>   */
>>>>  enum perf_sample_regs_abi {
>>>>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
>>>> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
>>>> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
>>>> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
>>>> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
>>>> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
>>>>  };
>>>>
>>>>  /*
>>>> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>>>>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
>>>>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
>>>>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
>>>> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
>>>>
>>>>  /*
>>>>   * 'struct perf_event_attr' contains various attributes that define
>>>> @@ -545,6 +547,25 @@ struct perf_event_attr {
>>>>         __u64   sig_data;
>>>>
>>>>         __u64   config3; /* extension of config2 */
>>>> +
>>>> +
>>>> +       /*
>>>> +        * Defines set of SIMD registers to dump on samples.
>>>> +        * The sample_simd_regs_enabled !=0 implies the
>>>> +        * set of SIMD registers is used to config all SIMD registers.
>>>> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
>>>> +        * config some SIMD registers on X86.
>>>> +        */
>>>> +       union {
>>>> +               __u16 sample_simd_regs_enabled;
>>>> +               __u16 sample_simd_pred_reg_qwords;
>>>> +       };
>>>> +       __u32 sample_simd_pred_reg_intr;
>>>> +       __u32 sample_simd_pred_reg_user;
>>>> +       __u16 sample_simd_vec_reg_qwords;
>>>> +       __u64 sample_simd_vec_reg_intr;
>>>> +       __u64 sample_simd_vec_reg_user;
>>>> +       __u32 __reserved_4;
>>>>  };
>>>>
>>>>  /*
>>>> @@ -1018,7 +1039,15 @@ enum perf_event_type {
>>>>          *      } && PERF_SAMPLE_BRANCH_STACK
>>>>          *
>>>>          *      { u64                   abi; # enum perf_sample_regs_abi
>>>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
>>>> +        *        u64                   regs[weight(mask)];
>>>> +        *        struct {
>>>> +        *              u16 nr_vectors;
>>>> +        *              u16 vector_qwords;
>>>> +        *              u16 nr_pred;
>>>> +        *              u16 pred_qwords;
>>>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>>>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>>> Why can't these values be taken from the perf_event_attr? The abi is
>>> needed as there could be both 32-bit and 64-bit samples for the same
>>> event - presumably x32 appears as 64-bit. If the ABI has SIMD within
>>> it (implied by the "} && (abi & PERF_SAMPLE_REGS_ABI_SIMD)" below)
>>> then why can't we just use the perf_event_attr values? For example,
>>> data could be "data[weight(sample_simd_vec_reg_user) *
>>> sample_simd_vec_reg_qwords + weight(sample_simd_pred_reg_user) *
>>> sample_simd_pred_reg_qwords]".
>> The main reason is that the sampled SIMD regs could only be a subset of the
>> requested SIMD regs in perf_event_attr, so we need to show the bitmask and
>> qwords length explicitly in the sample record.
> But this doesn't happen in any other register sampling, why in this case?
>
> Perhaps add comments along the lines:
> u16 nr_vectors;  // weight(sample_simd_vec_reg_user) except when ...
>
> My random guess as to why the value differs from the weight would be
> some kind of optimization around register values of 0. And even if the
> number of registers is reduced, why is the number of qwords being
> altered?

Yes. E.g., the user may want to sample ZMM registers (ZMM0 ~ ZMM31), but
the result is that only XMM registers (XMM0 ~ XMM15) are sampled at some
time, so both the registers number and qwords length are not identical with
the perf_event_attr values in some sampling records. Thus we need to
explicitly indicates the sampled registers number and length.

Besides, containing these 4 fields in sampling records makes the sampling
records be parsed more easily and don't need to retrieve information from
corresponding perf_event_attr. Thanks.


>
> Thanks,
> Ian
>
>>>> +        *      } && PERF_SAMPLE_REGS_USER
>>>>          *
>>>>          *      { u64                   size;
>>>>          *        char                  data[size];
>>>> @@ -1045,7 +1074,15 @@ enum perf_event_type {
>>>>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>>>>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>>>>          *      { u64                   abi; # enum perf_sample_regs_abi
>>>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>>>> +        *        u64                   regs[weight(mask)];
>>>> +        *        struct {
>>>> +        *              u16 nr_vectors;
>>>> +        *              u16 vector_qwords;
>>>> +        *              u16 nr_pred;
>>>> +        *              u16 pred_qwords;
>>>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>>>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>>> Same comment.
>>>
>>> Thanks,
>>> Ian
>>>
>>>> +        *      } && PERF_SAMPLE_REGS_INTR
>>>>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>>>>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>>>>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>>>> --
>>>> 2.34.1
>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2026-01-20  9:22         ` Mi, Dapeng
@ 2026-01-20 18:11           ` Ian Rogers
  2026-01-21  2:03             ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-20 18:11 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Tue, Jan 20, 2026 at 1:22 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 1/20/2026 4:00 PM, Ian Rogers wrote:
> > On Mon, Jan 19, 2026 at 11:43 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 1/20/2026 3:16 PM, Ian Rogers wrote:
> >>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>
> >>>> Update include/uapi/linux/perf_event.h and
> >>>> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
> >>>>
> >>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>> ---
> >>>>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
> >>>>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
> >>>>  2 files changed, 103 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
> >>>> index 7c9d2bb3833b..f3561ed10041 100644
> >>>> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
> >>>> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
> >>>> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
> >>>>         PERF_REG_X86_R13,
> >>>>         PERF_REG_X86_R14,
> >>>>         PERF_REG_X86_R15,
> >>>> +       /*
> >>>> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
> >>>> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
> >>>> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
> >>>> +        *
> >>>> +        * Extended GPRs (EGPRs)
> >>>> +        */
> >>>> +       PERF_REG_X86_R16,
> >>>> +       PERF_REG_X86_R17,
> >>>> +       PERF_REG_X86_R18,
> >>>> +       PERF_REG_X86_R19,
> >>>> +       PERF_REG_X86_R20,
> >>>> +       PERF_REG_X86_R21,
> >>>> +       PERF_REG_X86_R22,
> >>>> +       PERF_REG_X86_R23,
> >>>> +       PERF_REG_X86_R24,
> >>>> +       PERF_REG_X86_R25,
> >>>> +       PERF_REG_X86_R26,
> >>>> +       PERF_REG_X86_R27,
> >>>> +       PERF_REG_X86_R28,
> >>>> +       PERF_REG_X86_R29,
> >>>> +       PERF_REG_X86_R30,
> >>>> +       PERF_REG_X86_R31,
> >>>> +       PERF_REG_X86_SSP,
> >>>>         /* These are the limits for the GPRs. */
> >>>>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
> >>>>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
> >>>> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
> >>>>
> >>>>         /* These all need two bits set because they are 128bit */
> >>>>         PERF_REG_X86_XMM0  = 32,
> >>>> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
> >>>>  };
> >>>>
> >>>>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
> >>>> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
> >>>> +
> >>>> +enum {
> >>>> +       PERF_REG_X86_XMM,
> >>>> +       PERF_REG_X86_YMM,
> >>>> +       PERF_REG_X86_ZMM,
> >>>> +       PERF_REG_X86_MAX_SIMD_REGS,
> >>>> +
> >>>> +       PERF_REG_X86_OPMASK = 0,
> >>>> +       PERF_REG_X86_MAX_PRED_REGS = 1,
> >>>> +};
> >>>> +
> >>>> +enum {
> >>>> +       PERF_X86_SIMD_XMM_REGS      = 16,
> >>>> +       PERF_X86_SIMD_YMM_REGS      = 16,
> >>>> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
> >>>> +       PERF_X86_SIMD_ZMM_REGS      = 32,
> >>>> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
> >>>> +
> >>>> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
> >>>> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
> >>>> +};
> >>>> +
> >>>> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
> >>>> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
> >>>> +
> >>>> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
> >>>> +
> >>>> +enum {
> >>>> +       PERF_X86_OPMASK_QWORDS   = 1,
> >>>> +       PERF_X86_XMM_QWORDS      = 2,
> >>>> +       PERF_X86_YMMH_QWORDS     = 2,
> >>>> +       PERF_X86_YMM_QWORDS      = 4,
> >>>> +       PERF_X86_ZMMH_QWORDS     = 4,
> >>>> +       PERF_X86_ZMM_QWORDS      = 8,
> >>>> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
> >>>> +};
> >>>>
> >>>>  #endif /* _ASM_X86_PERF_REGS_H */
> >>>> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> >>>> index d292f96bc06f..f1474da32622 100644
> >>>> --- a/tools/include/uapi/linux/perf_event.h
> >>>> +++ b/tools/include/uapi/linux/perf_event.h
> >>>> @@ -314,8 +314,9 @@ enum {
> >>>>   */
> >>>>  enum perf_sample_regs_abi {
> >>>>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
> >>>> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
> >>>> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
> >>>> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
> >>>> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
> >>>> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
> >>>>  };
> >>>>
> >>>>  /*
> >>>> @@ -382,6 +383,7 @@ enum perf_event_read_format {
> >>>>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
> >>>>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
> >>>>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
> >>>> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
> >>>>
> >>>>  /*
> >>>>   * 'struct perf_event_attr' contains various attributes that define
> >>>> @@ -545,6 +547,25 @@ struct perf_event_attr {
> >>>>         __u64   sig_data;
> >>>>
> >>>>         __u64   config3; /* extension of config2 */
> >>>> +
> >>>> +
> >>>> +       /*
> >>>> +        * Defines set of SIMD registers to dump on samples.
> >>>> +        * The sample_simd_regs_enabled !=0 implies the
> >>>> +        * set of SIMD registers is used to config all SIMD registers.
> >>>> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> >>>> +        * config some SIMD registers on X86.
> >>>> +        */
> >>>> +       union {
> >>>> +               __u16 sample_simd_regs_enabled;
> >>>> +               __u16 sample_simd_pred_reg_qwords;
> >>>> +       };
> >>>> +       __u32 sample_simd_pred_reg_intr;
> >>>> +       __u32 sample_simd_pred_reg_user;
> >>>> +       __u16 sample_simd_vec_reg_qwords;
> >>>> +       __u64 sample_simd_vec_reg_intr;
> >>>> +       __u64 sample_simd_vec_reg_user;
> >>>> +       __u32 __reserved_4;
> >>>>  };
> >>>>
> >>>>  /*
> >>>> @@ -1018,7 +1039,15 @@ enum perf_event_type {
> >>>>          *      } && PERF_SAMPLE_BRANCH_STACK
> >>>>          *
> >>>>          *      { u64                   abi; # enum perf_sample_regs_abi
> >>>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> >>>> +        *        u64                   regs[weight(mask)];
> >>>> +        *        struct {
> >>>> +        *              u16 nr_vectors;
> >>>> +        *              u16 vector_qwords;
> >>>> +        *              u16 nr_pred;
> >>>> +        *              u16 pred_qwords;
> >>>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> >>>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> >>> Why can't these values be taken from the perf_event_attr? The abi is
> >>> needed as there could be both 32-bit and 64-bit samples for the same
> >>> event - presumably x32 appears as 64-bit. If the ABI has SIMD within
> >>> it (implied by the "} && (abi & PERF_SAMPLE_REGS_ABI_SIMD)" below)
> >>> then why can't we just use the perf_event_attr values? For example,
> >>> data could be "data[weight(sample_simd_vec_reg_user) *
> >>> sample_simd_vec_reg_qwords + weight(sample_simd_pred_reg_user) *
> >>> sample_simd_pred_reg_qwords]".
> >> The main reason is that the sampled SIMD regs could only be a subset of the
> >> requested SIMD regs in perf_event_attr, so we need to show the bitmask and
> >> qwords length explicitly in the sample record.
> > But this doesn't happen in any other register sampling, why in this case?
> >
> > Perhaps add comments along the lines:
> > u16 nr_vectors;  // weight(sample_simd_vec_reg_user) except when ...
> >
> > My random guess as to why the value differs from the weight would be
> > some kind of optimization around register values of 0. And even if the
> > number of registers is reduced, why is the number of qwords being
> > altered?
>
> Yes. E.g., the user may want to sample ZMM registers (ZMM0 ~ ZMM31), but
> the result is that only XMM registers (XMM0 ~ XMM15) are sampled at some
> time, so both the registers number and qwords length are not identical with
> the perf_event_attr values in some sampling records. Thus we need to
> explicitly indicates the sampled registers number and length.
>
> Besides, containing these 4 fields in sampling records makes the sampling
> records be parsed more easily and don't need to retrieve information from
> corresponding perf_event_attr. Thanks.

Sgtm (well you still need to look at the perf_event_attr for
regs[weight(mask)] immediately before this, but anyway). Can we add
comments to that effect? Something like:
```
       *              u16 nr_vectors;  # 0..weight(sample_simd_vec_reg_user)
       *              u16 vector_qwords; # 0..sample_simd_vec_reg_qwords
       *              u16 nr_pred; # 0..weight(sample_simd_pred_reg_user)
       *              u16 pred_qwords; 0..sample_simd_pred_reg_qwords
```
At least this hints at an optimization rather than a duplication bug.

Thanks,
Ian

> >
> > Thanks,
> > Ian
> >
> >>>> +        *      } && PERF_SAMPLE_REGS_USER
> >>>>          *
> >>>>          *      { u64                   size;
> >>>>          *        char                  data[size];
> >>>> @@ -1045,7 +1074,15 @@ enum perf_event_type {
> >>>>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
> >>>>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
> >>>>          *      { u64                   abi; # enum perf_sample_regs_abi
> >>>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> >>>> +        *        u64                   regs[weight(mask)];
> >>>> +        *        struct {
> >>>> +        *              u16 nr_vectors;
> >>>> +        *              u16 vector_qwords;
> >>>> +        *              u16 nr_pred;
> >>>> +        *              u16 pred_qwords;
> >>>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> >>>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> >>> Same comment.
> >>>
> >>> Thanks,
> >>> Ian
> >>>
> >>>> +        *      } && PERF_SAMPLE_REGS_INTR
> >>>>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
> >>>>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
> >>>>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
> >>>> --
> >>>> 2.34.1
> >>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-20  9:04     ` Mi, Dapeng
@ 2026-01-20 18:20       ` Ian Rogers
  2026-01-21  5:17         ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-20 18:20 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Tue, Jan 20, 2026 at 1:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 1/20/2026 3:39 PM, Ian Rogers wrote:
> > On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> This patch adds support for the newly introduced SIMD register sampling
> >> format by adding the following functions:
> >>
> >> uint64_t arch__intr_simd_reg_mask(void);
> >> uint64_t arch__user_simd_reg_mask(void);
> >> uint64_t arch__intr_pred_reg_mask(void);
> >> uint64_t arch__user_pred_reg_mask(void);
> >> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>
> >> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>
> >> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >> supported PRED registers, such as OPMASK on x86 platforms.
> >>
> >> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >> exact bitmap and number of qwords for a specific type of SIMD register.
> >> For example, for XMM registers on x86 platforms, the returned bitmap is
> >> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>
> >> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >> exact bitmap and number of qwords for a specific type of PRED register.
> >> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >> OPMASK).
> >>
> >> Additionally, the function __parse_regs() is enhanced to support parsing
> >> these newly introduced SIMD registers. Currently, each type of register
> >> can only be sampled collectively; sampling a specific SIMD register is
> >> not supported. For example, all XMM registers are sampled together rather
> >> than sampling only XMM0.
> >>
> >> When multiple overlapping register types, such as XMM and YMM, are
> >> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>
> >> With this patch, all supported sampling registers on x86 platforms are
> >> displayed as follows.
> >>
> >>  $perf record -I?
> >>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>
> >>  $perf record --user-regs=?
> >>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> ---
> >>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>  tools/perf/util/evsel.c                   |  27 ++
> >>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>  tools/perf/util/perf_regs.c               |  59 +++
> >>  tools/perf/util/perf_regs.h               |  11 +
> >>  tools/perf/util/record.h                  |   6 +
> >>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >> index 12fd93f04802..db41430f3b07 100644
> >> --- a/tools/perf/arch/x86/util/perf_regs.c
> >> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >> @@ -13,6 +13,49 @@
> >>  #include "../../../util/pmu.h"
> >>  #include "../../../util/pmus.h"
> >>
> >> +static const struct sample_reg sample_reg_masks_ext[] = {
> >> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >> +#endif
> >> +       SMPL_REG_END
> >> +};
> >> +
> >>  static const struct sample_reg sample_reg_masks[] = {
> >>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>         SMPL_REG(BX, PERF_REG_X86_BX),
> >> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>         return SDT_ARG_VALID;
> >>  }
> >>
> >> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> >> +{
> >> +       struct perf_event_attr attr = {
> >> +               .type                           = PERF_TYPE_HARDWARE,
> >> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >> +               .sample_type                    = sample_type,
> >> +               .disabled                       = 1,
> >> +               .exclude_kernel                 = 1,
> >> +               .sample_simd_regs_enabled       = 1,
> >> +       };
> >> +       int fd;
> >> +
> >> +       attr.sample_period = 1;
> >> +
> >> +       if (!pred) {
> >> +               attr.sample_simd_vec_reg_qwords = qwords;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       attr.sample_simd_vec_reg_intr = mask;
> >> +               else
> >> +                       attr.sample_simd_vec_reg_user = mask;
> >> +       } else {
> >> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >> +               else
> >> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >> +       }
> >> +
> >> +       if (perf_pmus__num_core_pmus() > 1) {
> >> +               struct perf_pmu *pmu = NULL;
> >> +               __u64 type = PERF_TYPE_RAW;
> >> +
> >> +               /*
> >> +                * The same register set is supported among different hybrid PMUs.
> >> +                * Only check the first available one.
> >> +                */
> >> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >> +                       type = pmu->type;
> >> +                       break;
> >> +               }
> >> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >> +       }
> >> +
> >> +       event_attr_init(&attr);
> >> +
> >> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >> +       if (fd != -1) {
> >> +               close(fd);
> >> +               return true;
> >> +       }
> >> +
> >> +       return false;
> >> +}
> >> +
> >> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >> +{
> >> +       bool supported = false;
> >> +       u64 bits;
> >> +
> >> +       *mask = 0;
> >> +       *qwords = 0;
> >> +
> >> +       switch (reg) {
> >> +       case PERF_REG_X86_XMM:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_XMM_QWORDS;
> >> +               }
> >> +               break;
> >> +       case PERF_REG_X86_YMM:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_YMM_QWORDS;
> >> +               }
> >> +               break;
> >> +       case PERF_REG_X86_ZMM:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >> +                       break;
> >> +               }
> >> +
> >> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >> +               }
> >> +               break;
> >> +       default:
> >> +               break;
> >> +       }
> >> +
> >> +       return supported;
> >> +}
> >> +
> >> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >> +{
> >> +       bool supported = false;
> >> +       u64 bits;
> >> +
> >> +       *mask = 0;
> >> +       *qwords = 0;
> >> +
> >> +       switch (reg) {
> >> +       case PERF_REG_X86_OPMASK:
> >> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >> +               if (supported) {
> >> +                       *mask = bits;
> >> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >> +               }
> >> +               break;
> >> +       default:
> >> +               break;
> >> +       }
> >> +
> >> +       return supported;
> >> +}
> >> +
> >> +static bool has_cap_simd_regs(void)
> >> +{
> >> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >> +       static bool has_cap_simd_regs;
> >> +       static bool cached;
> >> +
> >> +       if (cached)
> >> +               return has_cap_simd_regs;
> >> +
> >> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >> +       cached = true;
> >> +
> >> +       return has_cap_simd_regs;
> >> +}
> >> +
> >> +bool arch_has_simd_regs(u64 mask)
> >> +{
> >> +       return has_cap_simd_regs() &&
> >> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >> +}
> >> +
> >> +static const struct sample_reg sample_simd_reg_masks[] = {
> >> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >> +       SMPL_REG_END
> >> +};
> >> +
> >> +static const struct sample_reg sample_pred_reg_masks[] = {
> >> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >> +       SMPL_REG_END
> >> +};
> >> +
> >> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >> +{
> >> +       return sample_simd_reg_masks;
> >> +}
> >> +
> >> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >> +{
> >> +       return sample_pred_reg_masks;
> >> +}
> >> +
> >> +static bool x86_intr_simd_updated;
> >> +static u64 x86_intr_simd_reg_mask;
> >> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >> +static bool x86_user_simd_updated;
> >> +static u64 x86_user_simd_reg_mask;
> >> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >> +
> >> +static bool x86_intr_pred_updated;
> >> +static u64 x86_intr_pred_reg_mask;
> >> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >> +static bool x86_user_pred_updated;
> >> +static u64 x86_user_pred_reg_mask;
> >> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >> +
> >> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool supported;
> >> +       u64 mask = 0;
> >> +       int reg;
> >> +
> >> +       if (!has_cap_simd_regs())
> >> +               return 0;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >> +               return x86_intr_simd_reg_mask;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >> +               return x86_user_simd_reg_mask;
> >> +
> >> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >> +               supported = false;
> >> +
> >> +               if (!r->mask)
> >> +                       continue;
> >> +               reg = fls64(r->mask) - 1;
> >> +
> >> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >> +                       break;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >> +                                                        &x86_intr_simd_mask[reg],
> >> +                                                        &x86_intr_simd_qwords[reg]);
> >> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >> +                                                        &x86_user_simd_mask[reg],
> >> +                                                        &x86_user_simd_qwords[reg]);
> >> +               if (supported)
> >> +                       mask |= BIT_ULL(reg);
> >> +       }
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >> +               x86_intr_simd_reg_mask = mask;
> >> +               x86_intr_simd_updated = true;
> >> +       } else {
> >> +               x86_user_simd_reg_mask = mask;
> >> +               x86_user_simd_updated = true;
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> >> +
> >> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool supported;
> >> +       u64 mask = 0;
> >> +       int reg;
> >> +
> >> +       if (!has_cap_simd_regs())
> >> +               return 0;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >> +               return x86_intr_pred_reg_mask;
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >> +               return x86_user_pred_reg_mask;
> >> +
> >> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >> +               supported = false;
> >> +
> >> +               if (!r->mask)
> >> +                       continue;
> >> +               reg = fls64(r->mask) - 1;
> >> +
> >> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >> +                       break;
> >> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >> +                                                        &x86_intr_pred_mask[reg],
> >> +                                                        &x86_intr_pred_qwords[reg]);
> >> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >> +                                                        &x86_user_pred_mask[reg],
> >> +                                                        &x86_user_pred_qwords[reg]);
> >> +               if (supported)
> >> +                       mask |= BIT_ULL(reg);
> >> +       }
> >> +
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >> +               x86_intr_pred_reg_mask = mask;
> >> +               x86_intr_pred_updated = true;
> >> +       } else {
> >> +               x86_user_pred_reg_mask = mask;
> >> +               x86_user_pred_updated = true;
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> >> +
> >> +uint64_t arch__intr_simd_reg_mask(void)
> >> +{
> >> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >> +}
> >> +
> >> +uint64_t arch__user_simd_reg_mask(void)
> >> +{
> >> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >> +}
> >> +
> >> +uint64_t arch__intr_pred_reg_mask(void)
> >> +{
> >> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >> +}
> >> +
> >> +uint64_t arch__user_pred_reg_mask(void)
> >> +{
> >> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >> +}
> >> +
> >> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >> +{
> >> +       uint64_t mask = 0;
> >> +
> >> +       *qwords = 0;
> >> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >> +               if (intr) {
> >> +                       *qwords = x86_intr_simd_qwords[reg];
> >> +                       mask = x86_intr_simd_mask[reg];
> >> +               } else {
> >> +                       *qwords = x86_user_simd_qwords[reg];
> >> +                       mask = x86_user_simd_mask[reg];
> >> +               }
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> >> +
> >> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >> +{
> >> +       uint64_t mask = 0;
> >> +
> >> +       *qwords = 0;
> >> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >> +               if (intr) {
> >> +                       *qwords = x86_intr_pred_qwords[reg];
> >> +                       mask = x86_intr_pred_mask[reg];
> >> +               } else {
> >> +                       *qwords = x86_user_pred_qwords[reg];
> >> +                       mask = x86_user_pred_mask[reg];
> >> +               }
> >> +       }
> >> +
> >> +       return mask;
> >> +}
> >> +
> >> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_intr_simd_updated)
> >> +               arch__intr_simd_reg_mask();
> >> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >> +}
> >> +
> >> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_user_simd_updated)
> >> +               arch__user_simd_reg_mask();
> >> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >> +}
> >> +
> >> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_intr_pred_updated)
> >> +               arch__intr_pred_reg_mask();
> >> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >> +}
> >> +
> >> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >> +{
> >> +       if (!x86_user_pred_updated)
> >> +               arch__user_pred_reg_mask();
> >> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >> +}
> >> +
> >>  const struct sample_reg *arch__sample_reg_masks(void)
> >>  {
> >> +       if (has_cap_simd_regs())
> >> +               return sample_reg_masks_ext;
> >>         return sample_reg_masks;
> >>  }
> >>
> >> -uint64_t arch__intr_reg_mask(void)
> >> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>  {
> >>         struct perf_event_attr attr = {
> >> -               .type                   = PERF_TYPE_HARDWARE,
> >> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >> -               .precise_ip             = 1,
> >> -               .disabled               = 1,
> >> -               .exclude_kernel         = 1,
> >> +               .type                           = PERF_TYPE_HARDWARE,
> >> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >> +               .sample_type                    = sample_type,
> >> +               .precise_ip                     = 1,
> >> +               .disabled                       = 1,
> >> +               .exclude_kernel                 = 1,
> >> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>         };
> >>         int fd;
> >>         /*
> >>          * In an unnamed union, init it here to build on older gcc versions
> >>          */
> >>         attr.sample_period = 1;
> >> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >> +               attr.sample_regs_intr = mask;
> >> +       else
> >> +               attr.sample_regs_user = mask;
> >>
> >>         if (perf_pmus__num_core_pmus() > 1) {
> >>                 struct perf_pmu *pmu = NULL;
> >> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>         if (fd != -1) {
> >>                 close(fd);
> >> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >> +               return mask;
> >>         }
> >>
> >> -       return PERF_REGS_MASK;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t arch__intr_reg_mask(void)
> >> +{
> >> +       uint64_t mask = PERF_REGS_MASK;
> >> +
> >> +       if (has_cap_simd_regs()) {
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >> +                                        true);
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >> +                                        true);
> >> +       } else
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >> +
> >> +       return mask;
> >>  }
> >>
> >>  uint64_t arch__user_reg_mask(void)
> >>  {
> >> -       return PERF_REGS_MASK;
> >> +       uint64_t mask = PERF_REGS_MASK;
> >> +
> >> +       if (has_cap_simd_regs()) {
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >> +                                        true);
> >> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >> +                                        true);
> >> +       }
> >> +
> >> +       return mask;
> >>  }
> >> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >> index 56ebefd075f2..5d1d90cf9488 100644
> >> --- a/tools/perf/util/evsel.c
> >> +++ b/tools/perf/util/evsel.c
> >> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>             !evsel__is_dummy_event(evsel)) {
> >>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >> +       }
> >> +
> >> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >> +               if (opts->sample_pred_regs_qwords)
> >> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >> +               else
> >> +                       attr->sample_simd_pred_reg_qwords = 1;
> >> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>         }
> >>
> >>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>             !evsel__is_dummy_event(evsel)) {
> >>                 attr->sample_regs_user |= opts->sample_user_regs;
> >> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >> +               evsel__set_sample_bit(evsel, REGS_USER);
> >> +       }
> >> +
> >> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >> +               if (opts->sample_pred_regs_qwords)
> >> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >> +               else
> >> +                       attr->sample_simd_pred_reg_qwords = 1;
> >> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>         }
> >>
> >> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >> index cda1c620968e..0bd100392889 100644
> >> --- a/tools/perf/util/parse-regs-options.c
> >> +++ b/tools/perf/util/parse-regs-options.c
> >> @@ -4,19 +4,139 @@
> >>  #include <stdint.h>
> >>  #include <string.h>
> >>  #include <stdio.h>
> >> +#include <linux/bitops.h>
> >>  #include "util/debug.h"
> >>  #include <subcmd/parse-options.h>
> >>  #include "util/perf_regs.h"
> >>  #include "util/parse-regs-options.h"
> >> +#include "record.h"
> >> +
> >> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       if (!simd_mask)
> >> +               return;
> >> +
> >> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >> +               if (!(r->mask & simd_mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               if (bitmap)
> >> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >> +       }
> >> +}
> >> +
> >> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       if (!pred_mask)
> >> +               return;
> >> +
> >> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >> +               if (!(r->mask & pred_mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               if (bitmap)
> >> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >> +       }
> >> +}
> >> +
> >> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool matched = false;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >> +               if (strcasecmp(s, r->name))
> >> +                       continue;
> >> +               if (!fls64(r->mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               matched = true;
> >> +               break;
> >> +       }
> >> +
> >> +       /* Just need the highest qwords */
> >> +       if (qwords > opts->sample_vec_regs_qwords) {
> >> +               opts->sample_vec_regs_qwords = qwords;
> >> +               if (intr)
> >> +                       opts->sample_intr_vec_regs = bitmap;
> >> +               else
> >> +                       opts->sample_user_vec_regs = bitmap;
> >> +       }
> >> +
> >> +       return matched;
> >> +}
> >> +
> >> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >> +{
> >> +       const struct sample_reg *r = NULL;
> >> +       bool matched = false;
> >> +       uint64_t bitmap = 0;
> >> +       u16 qwords = 0;
> >> +       int reg_idx;
> >> +
> >> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >> +               if (strcasecmp(s, r->name))
> >> +                       continue;
> >> +               if (!fls64(r->mask))
> >> +                       continue;
> >> +               reg_idx = fls64(r->mask) - 1;
> >> +               if (intr)
> >> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               else
> >> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >> +               matched = true;
> >> +               break;
> >> +       }
> >> +
> >> +       /* Just need the highest qwords */
> >> +       if (qwords > opts->sample_pred_regs_qwords) {
> >> +               opts->sample_pred_regs_qwords = qwords;
> >> +               if (intr)
> >> +                       opts->sample_intr_pred_regs = bitmap;
> >> +               else
> >> +                       opts->sample_user_pred_regs = bitmap;
> >> +       }
> >> +
> >> +       return matched;
> >> +}
> >>
> >>  static int
> >>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>  {
> >>         uint64_t *mode = (uint64_t *)opt->value;
> >>         const struct sample_reg *r = NULL;
> >> +       struct record_opts *opts;
> >>         char *s, *os = NULL, *p;
> >> -       int ret = -1;
> >> +       bool has_simd_regs = false;
> >>         uint64_t mask;
> >> +       uint64_t simd_mask;
> >> +       uint64_t pred_mask;
> >> +       int ret = -1;
> >>
> >>         if (unset)
> >>                 return 0;
> >> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>         if (*mode)
> >>                 return -1;
> >>
> >> -       if (intr)
> >> +       if (intr) {
> >> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>                 mask = arch__intr_reg_mask();
> >> -       else
> >> +               simd_mask = arch__intr_simd_reg_mask();
> >> +               pred_mask = arch__intr_pred_reg_mask();
> >> +       } else {
> >> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>                 mask = arch__user_reg_mask();
> >> +               simd_mask = arch__user_simd_reg_mask();
> >> +               pred_mask = arch__user_pred_reg_mask();
> >> +       }
> >>
> >>         /* str may be NULL in case no arg is passed to -I */
> >>         if (str) {
> >> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>                                         if (r->mask & mask)
> >>                                                 fprintf(stderr, "%s ", r->name);
> >>                                 }
> >> +                               __print_simd_regs(intr, simd_mask);
> >> +                               __print_pred_regs(intr, pred_mask);
> >>                                 fputc('\n', stderr);
> >>                                 /* just printing available regs */
> >>                                 goto error;
> >>                         }
> >> +
> >> +                       if (simd_mask) {
> >> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >> +                               if (has_simd_regs)
> >> +                                       goto next;
> >> +                       }
> >> +                       if (pred_mask) {
> >> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >> +                               if (has_simd_regs)
> >> +                                       goto next;
> >> +                       }
> >> +
> >>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>                                         break;
> >> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>                         }
> >>
> >>                         *mode |= r->mask;
> >> -
> >> +next:
> >>                         if (!p)
> >>                                 break;
> >>
> >> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>         ret = 0;
> >>
> >>         /* default to all possible regs */
> >> -       if (*mode == 0)
> >> +       if (*mode == 0 && !has_simd_regs)
> >>                 *mode = mask;
> >>  error:
> >>         free(os);
> >> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >> index 66b666d9ce64..fb0366d050cf 100644
> >> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>         PRINT_ATTRf(aux_resume, p_unsigned);
> >> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>
> >>         return ret;
> >>  }
> >> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >> index 44b90bbf2d07..e8a9fabc92e6 100644
> >> --- a/tools/perf/util/perf_regs.c
> >> +++ b/tools/perf/util/perf_regs.c
> >> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>         return SDT_ARG_SKIP;
> >>  }
> >>
> >> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >> +{
> >> +       return false;
> >> +}
> >> +
> >>  uint64_t __weak arch__intr_reg_mask(void)
> >>  {
> >>         return 0;
> >> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>         return 0;
> >>  }
> >>
> >> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_simd_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_pred_reg_mask(void)
> >> +{
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >> +{
> >> +       *qwords = 0;
> >> +       return 0;
> >> +}
> >> +
> >>  static const struct sample_reg sample_reg_masks[] = {
> >>         SMPL_REG_END
> >>  };
> >> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>         return sample_reg_masks;
> >>  }
> >>
> >> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >> +{
> >> +       return sample_reg_masks;
> >> +}
> >> +
> >> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >> +{
> >> +       return sample_reg_masks;
> >> +}
> >> +
> >>  const char *perf_reg_name(int id, const char *arch)
> >>  {
> >>         const char *reg_name = NULL;
> >> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >> index f2d0736d65cc..bce9c4cfd1bf 100644
> >> --- a/tools/perf/util/perf_regs.h
> >> +++ b/tools/perf/util/perf_regs.h
> >> @@ -24,9 +24,20 @@ enum {
> >>  };
> >>
> >>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >> +bool arch_has_simd_regs(u64 mask);
> >>  uint64_t arch__intr_reg_mask(void);
> >>  uint64_t arch__user_reg_mask(void);
> >>  const struct sample_reg *arch__sample_reg_masks(void);
> >> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> > I wonder we can remove these functions. perf_reg_name(int id, uint16_t
> > e_machine) maps a perf register number and e_machine to a string. So
> > the sample_reg array could be replaced with:
> > ```
> > for (int perf_reg = 0; perf_reg < 64; perf_reg++) {
> >   uint64_t mask = 1LL << perf_reg;
> >   const char *name = perf_reg_name(perf_reg, EM_HOST);
> >   if (name == NULL)
> >     break;
> >   // use mask and name
> > ```
> > To make it work for SIMD and PRED then I guess we need to iterate
> > through the ABIs of enum perf_sample_regs_abi.
>
> Suppose so.
>
>
> >
> >> +uint64_t arch__intr_simd_reg_mask(void);
> >> +uint64_t arch__user_simd_reg_mask(void);
> >> +uint64_t arch__intr_pred_reg_mask(void);
> >> +uint64_t arch__user_pred_reg_mask(void);
> > I think some comments would be useful here like:
> > ```
> > /* Perf register bit map with valid bits for
> > perf_event_attr.sample_regs_user. */
> > uint64_t arch__intr_reg_mask(void);
> > /* Perf register bit map with valid bits for
> > perf_event_attr.sample_regs_intr. */
> > uint64_t arch__user_reg_mask(void);
> > /* Perf register bit map with valid bits for
> > perf_event_attr.sample_simd_vec_reg_intr. */
> > uint64_t arch__intr_simd_reg_mask(void);
> > /* Perf register bit map with valid bits for
> > perf_event_attr.sample_simd_vec_reg_user. */
> > uint64_t arch__user_simd_reg_mask(void);
> > /* Perf register bit map with valid bits for
> > perf_event_attr.sample_simd_pred_reg_intr. */
> > uint64_t arch__intr_pred_reg_mask(void);
> > /* Perf register bit map with valid bits for
> > perf_event_attr.sample_simd_pred_reg_user. */
> > uint64_t arch__user_pred_reg_mask(void);
>
> Sure. Thanks.
>
>
> > ```
> >
> > Why do the arch__user_pred_reg_mask return a uint64_t when the
> > perf_event_attr variable is a __u32?
>
> Suppose it's a bug. :)
>
>
> >
> >> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> > I don't understand this function. The qwords is specific to a
> > perf_event_attr. We could have an evlist with an evsel set up to
> > sample say XMM registers and another evsel set up to sample ZMM
> > registers. Are the qwords here always for the ZMM case, or is XMM,
> > YMM, ZMM depending on architecture support? Why does it vary per
> > register? The surrounding code uses the term mask but here bitmap is
> > used, is the inconsistency deliberate? Why are there user and intr
> > functions when in the perf_event_attr there are only
> > sample_simd_pred_reg_qwords and sample_simd_ved_reg_qwords variables?
>
> These 4 functions is designed to get the bitmask and qwords length for a
> specific SIMD registers. E.g., For XMM on x86 platforms, the returned
> bitmask is 0xffff (xmm0 ~ xmm15) and the qwords length is 2 (128 bits). For
> ZMM on x86 platforms, if the platform only supports 16 ZMM registers, then
> the returned bitmask is 0xffff (zmm0 ~ zmm15) and qwords length is 8 (512
> bits). If the platform supports 32 ZMM registers, then the returned bitmask
> is 0xffffffff (zmm0 ~ zmm31) and qwords length is 8 (512 bits).

What is the meaning of reg? In this file it is normally the integer
index for a bit in the sample_regs_user mask, but for x86 I don't see
enum perf_event_x86_regs having differing XMM, YMM and ZMM encodings.
Similarly, is qwords an out argument, but then you also have the
bitmap. It looks like the code is caching values but that assumes a
single qword length for all events.

> Since the qword length is always fixed for any certain SIMD register
> regardless of intr or user, so there is only one
> sample_simd_pred_reg_qwords or sample_simd_ved_reg_qwords variable.

Ok.  2 variables, but 4 functions here. I think there should just be 2
because of this.

Thanks,
Ian

> >
> > Perhaps these functions should be something more like:
> > ```
> > /* Maximum value that can be assigned to
> > perf_event_atttr.sample_simd_pred_reg_qwords. */
> > uint16_t arch__simd_pred_reg_qwords_max(void);
> > /* Maximum value that can be assigned to
> > perf_event_atttr.sample_simd_vec_reg_qwords. */
> > uint16_t arch__simd_vec_reg_qwords_max(void);
> > ```
> > Then the bitmap computation logic can all be moved into parse-regs-options.c.
> >
> > Thanks,
> > Ian
> >
> >>  const char *perf_reg_name(int id, const char *arch);
> >>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >> index ea3a6c4657ee..825ffb4cc53f 100644
> >> --- a/tools/perf/util/record.h
> >> +++ b/tools/perf/util/record.h
> >> @@ -59,7 +59,13 @@ struct record_opts {
> >>         unsigned int  user_freq;
> >>         u64           branch_stack;
> >>         u64           sample_intr_regs;
> >> +       u64           sample_intr_vec_regs;
> >>         u64           sample_user_regs;
> >> +       u64           sample_user_vec_regs;
> >> +       u16           sample_pred_regs_qwords;
> >> +       u16           sample_vec_regs_qwords;
> >> +       u16           sample_intr_pred_regs;
> >> +       u16           sample_user_pred_regs;
> >>         u64           default_interval;
> >>         u64           user_interval;
> >>         size_t        auxtrace_snapshot_size;
> >> --
> >> 2.34.1
> >>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 17/19] perf headers: Sync with the kernel headers
  2026-01-20 18:11           ` Ian Rogers
@ 2026-01-21  2:03             ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-21  2:03 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/21/2026 2:11 AM, Ian Rogers wrote:
> On Tue, Jan 20, 2026 at 1:22 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 1/20/2026 4:00 PM, Ian Rogers wrote:
>>> On Mon, Jan 19, 2026 at 11:43 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 1/20/2026 3:16 PM, Ian Rogers wrote:
>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>
>>>>>> Update include/uapi/linux/perf_event.h and
>>>>>> arch/x86/include/uapi/asm/perf_regs.h to support extended regs.
>>>>>>
>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>> ---
>>>>>>  tools/arch/x86/include/uapi/asm/perf_regs.h | 62 +++++++++++++++++++++
>>>>>>  tools/include/uapi/linux/perf_event.h       | 45 +++++++++++++--
>>>>>>  2 files changed, 103 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
>>>>>> index 7c9d2bb3833b..f3561ed10041 100644
>>>>>> --- a/tools/arch/x86/include/uapi/asm/perf_regs.h
>>>>>> +++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
>>>>>> @@ -27,9 +27,34 @@ enum perf_event_x86_regs {
>>>>>>         PERF_REG_X86_R13,
>>>>>>         PERF_REG_X86_R14,
>>>>>>         PERF_REG_X86_R15,
>>>>>> +       /*
>>>>>> +        * The EGPRs/SSP and XMM have overlaps. Only one can be used
>>>>>> +        * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
>>>>>> +        * utilize EGPRs/SSP. For the other ABI type, XMM is used.
>>>>>> +        *
>>>>>> +        * Extended GPRs (EGPRs)
>>>>>> +        */
>>>>>> +       PERF_REG_X86_R16,
>>>>>> +       PERF_REG_X86_R17,
>>>>>> +       PERF_REG_X86_R18,
>>>>>> +       PERF_REG_X86_R19,
>>>>>> +       PERF_REG_X86_R20,
>>>>>> +       PERF_REG_X86_R21,
>>>>>> +       PERF_REG_X86_R22,
>>>>>> +       PERF_REG_X86_R23,
>>>>>> +       PERF_REG_X86_R24,
>>>>>> +       PERF_REG_X86_R25,
>>>>>> +       PERF_REG_X86_R26,
>>>>>> +       PERF_REG_X86_R27,
>>>>>> +       PERF_REG_X86_R28,
>>>>>> +       PERF_REG_X86_R29,
>>>>>> +       PERF_REG_X86_R30,
>>>>>> +       PERF_REG_X86_R31,
>>>>>> +       PERF_REG_X86_SSP,
>>>>>>         /* These are the limits for the GPRs. */
>>>>>>         PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>>>>>>         PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
>>>>>> +       PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
>>>>>>
>>>>>>         /* These all need two bits set because they are 128bit */
>>>>>>         PERF_REG_X86_XMM0  = 32,
>>>>>> @@ -54,5 +79,42 @@ enum perf_event_x86_regs {
>>>>>>  };
>>>>>>
>>>>>>  #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
>>>>>> +#define PERF_X86_EGPRS_MASK    GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
>>>>>> +
>>>>>> +enum {
>>>>>> +       PERF_REG_X86_XMM,
>>>>>> +       PERF_REG_X86_YMM,
>>>>>> +       PERF_REG_X86_ZMM,
>>>>>> +       PERF_REG_X86_MAX_SIMD_REGS,
>>>>>> +
>>>>>> +       PERF_REG_X86_OPMASK = 0,
>>>>>> +       PERF_REG_X86_MAX_PRED_REGS = 1,
>>>>>> +};
>>>>>> +
>>>>>> +enum {
>>>>>> +       PERF_X86_SIMD_XMM_REGS      = 16,
>>>>>> +       PERF_X86_SIMD_YMM_REGS      = 16,
>>>>>> +       PERF_X86_SIMD_ZMMH_REGS     = 16,
>>>>>> +       PERF_X86_SIMD_ZMM_REGS      = 32,
>>>>>> +       PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
>>>>>> +
>>>>>> +       PERF_X86_SIMD_OPMASK_REGS   = 8,
>>>>>> +       PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
>>>>>> +};
>>>>>> +
>>>>>> +#define PERF_X86_SIMD_PRED_MASK                GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
>>>>>> +#define PERF_X86_SIMD_VEC_MASK         GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
>>>>>> +
>>>>>> +#define PERF_X86_H16ZMM_BASE           PERF_X86_SIMD_ZMMH_REGS
>>>>>> +
>>>>>> +enum {
>>>>>> +       PERF_X86_OPMASK_QWORDS   = 1,
>>>>>> +       PERF_X86_XMM_QWORDS      = 2,
>>>>>> +       PERF_X86_YMMH_QWORDS     = 2,
>>>>>> +       PERF_X86_YMM_QWORDS      = 4,
>>>>>> +       PERF_X86_ZMMH_QWORDS     = 4,
>>>>>> +       PERF_X86_ZMM_QWORDS      = 8,
>>>>>> +       PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
>>>>>> +};
>>>>>>
>>>>>>  #endif /* _ASM_X86_PERF_REGS_H */
>>>>>> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
>>>>>> index d292f96bc06f..f1474da32622 100644
>>>>>> --- a/tools/include/uapi/linux/perf_event.h
>>>>>> +++ b/tools/include/uapi/linux/perf_event.h
>>>>>> @@ -314,8 +314,9 @@ enum {
>>>>>>   */
>>>>>>  enum perf_sample_regs_abi {
>>>>>>         PERF_SAMPLE_REGS_ABI_NONE               = 0,
>>>>>> -       PERF_SAMPLE_REGS_ABI_32                 = 1,
>>>>>> -       PERF_SAMPLE_REGS_ABI_64                 = 2,
>>>>>> +       PERF_SAMPLE_REGS_ABI_32                 = (1 << 0),
>>>>>> +       PERF_SAMPLE_REGS_ABI_64                 = (1 << 1),
>>>>>> +       PERF_SAMPLE_REGS_ABI_SIMD               = (1 << 2),
>>>>>>  };
>>>>>>
>>>>>>  /*
>>>>>> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>>>>>>  #define PERF_ATTR_SIZE_VER6                    120     /* Add: aux_sample_size */
>>>>>>  #define PERF_ATTR_SIZE_VER7                    128     /* Add: sig_data */
>>>>>>  #define PERF_ATTR_SIZE_VER8                    136     /* Add: config3 */
>>>>>> +#define PERF_ATTR_SIZE_VER9                    168     /* Add: sample_simd_{pred,vec}_reg_* */
>>>>>>
>>>>>>  /*
>>>>>>   * 'struct perf_event_attr' contains various attributes that define
>>>>>> @@ -545,6 +547,25 @@ struct perf_event_attr {
>>>>>>         __u64   sig_data;
>>>>>>
>>>>>>         __u64   config3; /* extension of config2 */
>>>>>> +
>>>>>> +
>>>>>> +       /*
>>>>>> +        * Defines set of SIMD registers to dump on samples.
>>>>>> +        * The sample_simd_regs_enabled !=0 implies the
>>>>>> +        * set of SIMD registers is used to config all SIMD registers.
>>>>>> +        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
>>>>>> +        * config some SIMD registers on X86.
>>>>>> +        */
>>>>>> +       union {
>>>>>> +               __u16 sample_simd_regs_enabled;
>>>>>> +               __u16 sample_simd_pred_reg_qwords;
>>>>>> +       };
>>>>>> +       __u32 sample_simd_pred_reg_intr;
>>>>>> +       __u32 sample_simd_pred_reg_user;
>>>>>> +       __u16 sample_simd_vec_reg_qwords;
>>>>>> +       __u64 sample_simd_vec_reg_intr;
>>>>>> +       __u64 sample_simd_vec_reg_user;
>>>>>> +       __u32 __reserved_4;
>>>>>>  };
>>>>>>
>>>>>>  /*
>>>>>> @@ -1018,7 +1039,15 @@ enum perf_event_type {
>>>>>>          *      } && PERF_SAMPLE_BRANCH_STACK
>>>>>>          *
>>>>>>          *      { u64                   abi; # enum perf_sample_regs_abi
>>>>>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
>>>>>> +        *        u64                   regs[weight(mask)];
>>>>>> +        *        struct {
>>>>>> +        *              u16 nr_vectors;
>>>>>> +        *              u16 vector_qwords;
>>>>>> +        *              u16 nr_pred;
>>>>>> +        *              u16 pred_qwords;
>>>>>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>>>>>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>>>>> Why can't these values be taken from the perf_event_attr? The abi is
>>>>> needed as there could be both 32-bit and 64-bit samples for the same
>>>>> event - presumably x32 appears as 64-bit. If the ABI has SIMD within
>>>>> it (implied by the "} && (abi & PERF_SAMPLE_REGS_ABI_SIMD)" below)
>>>>> then why can't we just use the perf_event_attr values? For example,
>>>>> data could be "data[weight(sample_simd_vec_reg_user) *
>>>>> sample_simd_vec_reg_qwords + weight(sample_simd_pred_reg_user) *
>>>>> sample_simd_pred_reg_qwords]".
>>>> The main reason is that the sampled SIMD regs could only be a subset of the
>>>> requested SIMD regs in perf_event_attr, so we need to show the bitmask and
>>>> qwords length explicitly in the sample record.
>>> But this doesn't happen in any other register sampling, why in this case?
>>>
>>> Perhaps add comments along the lines:
>>> u16 nr_vectors;  // weight(sample_simd_vec_reg_user) except when ...
>>>
>>> My random guess as to why the value differs from the weight would be
>>> some kind of optimization around register values of 0. And even if the
>>> number of registers is reduced, why is the number of qwords being
>>> altered?
>> Yes. E.g., the user may want to sample ZMM registers (ZMM0 ~ ZMM31), but
>> the result is that only XMM registers (XMM0 ~ XMM15) are sampled at some
>> time, so both the registers number and qwords length are not identical with
>> the perf_event_attr values in some sampling records. Thus we need to
>> explicitly indicates the sampled registers number and length.
>>
>> Besides, containing these 4 fields in sampling records makes the sampling
>> records be parsed more easily and don't need to retrieve information from
>> corresponding perf_event_attr. Thanks.
> Sgtm (well you still need to look at the perf_event_attr for
> regs[weight(mask)] immediately before this, but anyway). Can we add
> comments to that effect? Something like:
> ```
>        *              u16 nr_vectors;  # 0..weight(sample_simd_vec_reg_user)
>        *              u16 vector_qwords; # 0..sample_simd_vec_reg_qwords
>        *              u16 nr_pred; # 0..weight(sample_simd_pred_reg_user)
>        *              u16 pred_qwords; 0..sample_simd_pred_reg_qwords
> ```
> At least this hints at an optimization rather than a duplication bug.

Sure. Thanks.


>
> Thanks,
> Ian
>
>>> Thanks,
>>> Ian
>>>
>>>>>> +        *      } && PERF_SAMPLE_REGS_USER
>>>>>>          *
>>>>>>          *      { u64                   size;
>>>>>>          *        char                  data[size];
>>>>>> @@ -1045,7 +1074,15 @@ enum perf_event_type {
>>>>>>          *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
>>>>>>          *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
>>>>>>          *      { u64                   abi; # enum perf_sample_regs_abi
>>>>>> -        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>>>>>> +        *        u64                   regs[weight(mask)];
>>>>>> +        *        struct {
>>>>>> +        *              u16 nr_vectors;
>>>>>> +        *              u16 vector_qwords;
>>>>>> +        *              u16 nr_pred;
>>>>>> +        *              u16 pred_qwords;
>>>>>> +        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>>>>>> +        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>>>>> Same comment.
>>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>> +        *      } && PERF_SAMPLE_REGS_INTR
>>>>>>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>>>>>>          *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
>>>>>>          *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>>>>>> --
>>>>>> 2.34.1
>>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-20 18:20       ` Ian Rogers
@ 2026-01-21  5:17         ` Mi, Dapeng
  2026-01-21  7:09           ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-21  5:17 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/21/2026 2:20 AM, Ian Rogers wrote:
> On Tue, Jan 20, 2026 at 1:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 1/20/2026 3:39 PM, Ian Rogers wrote:
>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> This patch adds support for the newly introduced SIMD register sampling
>>>> format by adding the following functions:
>>>>
>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>> uint64_t arch__user_simd_reg_mask(void);
>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>> uint64_t arch__user_pred_reg_mask(void);
>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>
>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>
>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>
>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>
>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>> OPMASK).
>>>>
>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>> these newly introduced SIMD registers. Currently, each type of register
>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>> not supported. For example, all XMM registers are sampled together rather
>>>> than sampling only XMM0.
>>>>
>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>
>>>> With this patch, all supported sampling registers on x86 platforms are
>>>> displayed as follows.
>>>>
>>>>  $perf record -I?
>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>
>>>>  $perf record --user-regs=?
>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>
>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>> ---
>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>  tools/perf/util/record.h                  |   6 +
>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>> index 12fd93f04802..db41430f3b07 100644
>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>> @@ -13,6 +13,49 @@
>>>>  #include "../../../util/pmu.h"
>>>>  #include "../../../util/pmus.h"
>>>>
>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>> +#endif
>>>> +       SMPL_REG_END
>>>> +};
>>>> +
>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>         return SDT_ARG_VALID;
>>>>  }
>>>>
>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>>> +{
>>>> +       struct perf_event_attr attr = {
>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>> +               .sample_type                    = sample_type,
>>>> +               .disabled                       = 1,
>>>> +               .exclude_kernel                 = 1,
>>>> +               .sample_simd_regs_enabled       = 1,
>>>> +       };
>>>> +       int fd;
>>>> +
>>>> +       attr.sample_period = 1;
>>>> +
>>>> +       if (!pred) {
>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>> +               else
>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>> +       } else {
>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>> +               else
>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>> +       }
>>>> +
>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>> +               struct perf_pmu *pmu = NULL;
>>>> +               __u64 type = PERF_TYPE_RAW;
>>>> +
>>>> +               /*
>>>> +                * The same register set is supported among different hybrid PMUs.
>>>> +                * Only check the first available one.
>>>> +                */
>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>> +                       type = pmu->type;
>>>> +                       break;
>>>> +               }
>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>> +       }
>>>> +
>>>> +       event_attr_init(&attr);
>>>> +
>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>> +       if (fd != -1) {
>>>> +               close(fd);
>>>> +               return true;
>>>> +       }
>>>> +
>>>> +       return false;
>>>> +}
>>>> +
>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>> +{
>>>> +       bool supported = false;
>>>> +       u64 bits;
>>>> +
>>>> +       *mask = 0;
>>>> +       *qwords = 0;
>>>> +
>>>> +       switch (reg) {
>>>> +       case PERF_REG_X86_XMM:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       case PERF_REG_X86_YMM:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       case PERF_REG_X86_ZMM:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>> +                       break;
>>>> +               }
>>>> +
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       default:
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       return supported;
>>>> +}
>>>> +
>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>> +{
>>>> +       bool supported = false;
>>>> +       u64 bits;
>>>> +
>>>> +       *mask = 0;
>>>> +       *qwords = 0;
>>>> +
>>>> +       switch (reg) {
>>>> +       case PERF_REG_X86_OPMASK:
>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>> +               if (supported) {
>>>> +                       *mask = bits;
>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>> +               }
>>>> +               break;
>>>> +       default:
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       return supported;
>>>> +}
>>>> +
>>>> +static bool has_cap_simd_regs(void)
>>>> +{
>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>> +       static bool has_cap_simd_regs;
>>>> +       static bool cached;
>>>> +
>>>> +       if (cached)
>>>> +               return has_cap_simd_regs;
>>>> +
>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>> +       cached = true;
>>>> +
>>>> +       return has_cap_simd_regs;
>>>> +}
>>>> +
>>>> +bool arch_has_simd_regs(u64 mask)
>>>> +{
>>>> +       return has_cap_simd_regs() &&
>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>> +}
>>>> +
>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>> +       SMPL_REG_END
>>>> +};
>>>> +
>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>> +       SMPL_REG_END
>>>> +};
>>>> +
>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>> +{
>>>> +       return sample_simd_reg_masks;
>>>> +}
>>>> +
>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>> +{
>>>> +       return sample_pred_reg_masks;
>>>> +}
>>>> +
>>>> +static bool x86_intr_simd_updated;
>>>> +static u64 x86_intr_simd_reg_mask;
>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>> +static bool x86_user_simd_updated;
>>>> +static u64 x86_user_simd_reg_mask;
>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>> +
>>>> +static bool x86_intr_pred_updated;
>>>> +static u64 x86_intr_pred_reg_mask;
>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>> +static bool x86_user_pred_updated;
>>>> +static u64 x86_user_pred_reg_mask;
>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>> +
>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool supported;
>>>> +       u64 mask = 0;
>>>> +       int reg;
>>>> +
>>>> +       if (!has_cap_simd_regs())
>>>> +               return 0;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>> +               return x86_intr_simd_reg_mask;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>> +               return x86_user_simd_reg_mask;
>>>> +
>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>> +               supported = false;
>>>> +
>>>> +               if (!r->mask)
>>>> +                       continue;
>>>> +               reg = fls64(r->mask) - 1;
>>>> +
>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>> +                       break;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>> +                                                        &x86_intr_simd_mask[reg],
>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>> +                                                        &x86_user_simd_mask[reg],
>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>> +               if (supported)
>>>> +                       mask |= BIT_ULL(reg);
>>>> +       }
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>> +               x86_intr_simd_reg_mask = mask;
>>>> +               x86_intr_simd_updated = true;
>>>> +       } else {
>>>> +               x86_user_simd_reg_mask = mask;
>>>> +               x86_user_simd_updated = true;
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>>> +
>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool supported;
>>>> +       u64 mask = 0;
>>>> +       int reg;
>>>> +
>>>> +       if (!has_cap_simd_regs())
>>>> +               return 0;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>> +               return x86_intr_pred_reg_mask;
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>> +               return x86_user_pred_reg_mask;
>>>> +
>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>> +               supported = false;
>>>> +
>>>> +               if (!r->mask)
>>>> +                       continue;
>>>> +               reg = fls64(r->mask) - 1;
>>>> +
>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>> +                       break;
>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>> +                                                        &x86_intr_pred_mask[reg],
>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>> +                                                        &x86_user_pred_mask[reg],
>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>> +               if (supported)
>>>> +                       mask |= BIT_ULL(reg);
>>>> +       }
>>>> +
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>> +               x86_intr_pred_reg_mask = mask;
>>>> +               x86_intr_pred_updated = true;
>>>> +       } else {
>>>> +               x86_user_pred_reg_mask = mask;
>>>> +               x86_user_pred_updated = true;
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>> +{
>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>> +{
>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>> +{
>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>> +{
>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>> +}
>>>> +
>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>> +{
>>>> +       uint64_t mask = 0;
>>>> +
>>>> +       *qwords = 0;
>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>> +               if (intr) {
>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>> +                       mask = x86_intr_simd_mask[reg];
>>>> +               } else {
>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>> +                       mask = x86_user_simd_mask[reg];
>>>> +               }
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>>> +
>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>> +{
>>>> +       uint64_t mask = 0;
>>>> +
>>>> +       *qwords = 0;
>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>> +               if (intr) {
>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>> +                       mask = x86_intr_pred_mask[reg];
>>>> +               } else {
>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>> +                       mask = x86_user_pred_mask[reg];
>>>> +               }
>>>> +       }
>>>> +
>>>> +       return mask;
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_intr_simd_updated)
>>>> +               arch__intr_simd_reg_mask();
>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_user_simd_updated)
>>>> +               arch__user_simd_reg_mask();
>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_intr_pred_updated)
>>>> +               arch__intr_pred_reg_mask();
>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>> +}
>>>> +
>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>> +{
>>>> +       if (!x86_user_pred_updated)
>>>> +               arch__user_pred_reg_mask();
>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>> +}
>>>> +
>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>  {
>>>> +       if (has_cap_simd_regs())
>>>> +               return sample_reg_masks_ext;
>>>>         return sample_reg_masks;
>>>>  }
>>>>
>>>> -uint64_t arch__intr_reg_mask(void)
>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>  {
>>>>         struct perf_event_attr attr = {
>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>> -               .precise_ip             = 1,
>>>> -               .disabled               = 1,
>>>> -               .exclude_kernel         = 1,
>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>> +               .sample_type                    = sample_type,
>>>> +               .precise_ip                     = 1,
>>>> +               .disabled                       = 1,
>>>> +               .exclude_kernel                 = 1,
>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>         };
>>>>         int fd;
>>>>         /*
>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>          */
>>>>         attr.sample_period = 1;
>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>> +               attr.sample_regs_intr = mask;
>>>> +       else
>>>> +               attr.sample_regs_user = mask;
>>>>
>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>                 struct perf_pmu *pmu = NULL;
>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>         if (fd != -1) {
>>>>                 close(fd);
>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>> +               return mask;
>>>>         }
>>>>
>>>> -       return PERF_REGS_MASK;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t arch__intr_reg_mask(void)
>>>> +{
>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>> +
>>>> +       if (has_cap_simd_regs()) {
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>> +                                        true);
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>> +                                        true);
>>>> +       } else
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>> +
>>>> +       return mask;
>>>>  }
>>>>
>>>>  uint64_t arch__user_reg_mask(void)
>>>>  {
>>>> -       return PERF_REGS_MASK;
>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>> +
>>>> +       if (has_cap_simd_regs()) {
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>> +                                        true);
>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>> +                                        true);
>>>> +       }
>>>> +
>>>> +       return mask;
>>>>  }
>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>> --- a/tools/perf/util/evsel.c
>>>> +++ b/tools/perf/util/evsel.c
>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>             !evsel__is_dummy_event(evsel)) {
>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>> +       }
>>>> +
>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>> +               if (opts->sample_pred_regs_qwords)
>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>> +               else
>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>         }
>>>>
>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>             !evsel__is_dummy_event(evsel)) {
>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>> +       }
>>>> +
>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>> +               if (opts->sample_pred_regs_qwords)
>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>> +               else
>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>         }
>>>>
>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>> index cda1c620968e..0bd100392889 100644
>>>> --- a/tools/perf/util/parse-regs-options.c
>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>> @@ -4,19 +4,139 @@
>>>>  #include <stdint.h>
>>>>  #include <string.h>
>>>>  #include <stdio.h>
>>>> +#include <linux/bitops.h>
>>>>  #include "util/debug.h"
>>>>  #include <subcmd/parse-options.h>
>>>>  #include "util/perf_regs.h"
>>>>  #include "util/parse-regs-options.h"
>>>> +#include "record.h"
>>>> +
>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       if (!simd_mask)
>>>> +               return;
>>>> +
>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>> +               if (!(r->mask & simd_mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               if (bitmap)
>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>> +       }
>>>> +}
>>>> +
>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       if (!pred_mask)
>>>> +               return;
>>>> +
>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>> +               if (!(r->mask & pred_mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               if (bitmap)
>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>> +       }
>>>> +}
>>>> +
>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool matched = false;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>> +               if (strcasecmp(s, r->name))
>>>> +                       continue;
>>>> +               if (!fls64(r->mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               matched = true;
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       /* Just need the highest qwords */
>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>> +               if (intr)
>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>> +               else
>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>> +       }
>>>> +
>>>> +       return matched;
>>>> +}
>>>> +
>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>> +{
>>>> +       const struct sample_reg *r = NULL;
>>>> +       bool matched = false;
>>>> +       uint64_t bitmap = 0;
>>>> +       u16 qwords = 0;
>>>> +       int reg_idx;
>>>> +
>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>> +               if (strcasecmp(s, r->name))
>>>> +                       continue;
>>>> +               if (!fls64(r->mask))
>>>> +                       continue;
>>>> +               reg_idx = fls64(r->mask) - 1;
>>>> +               if (intr)
>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               else
>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>> +               matched = true;
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       /* Just need the highest qwords */
>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>> +               if (intr)
>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>> +               else
>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>> +       }
>>>> +
>>>> +       return matched;
>>>> +}
>>>>
>>>>  static int
>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>  {
>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>         const struct sample_reg *r = NULL;
>>>> +       struct record_opts *opts;
>>>>         char *s, *os = NULL, *p;
>>>> -       int ret = -1;
>>>> +       bool has_simd_regs = false;
>>>>         uint64_t mask;
>>>> +       uint64_t simd_mask;
>>>> +       uint64_t pred_mask;
>>>> +       int ret = -1;
>>>>
>>>>         if (unset)
>>>>                 return 0;
>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>         if (*mode)
>>>>                 return -1;
>>>>
>>>> -       if (intr)
>>>> +       if (intr) {
>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>                 mask = arch__intr_reg_mask();
>>>> -       else
>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>> +       } else {
>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>                 mask = arch__user_reg_mask();
>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>> +       }
>>>>
>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>         if (str) {
>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>                                         if (r->mask & mask)
>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>                                 }
>>>> +                               __print_simd_regs(intr, simd_mask);
>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>                                 fputc('\n', stderr);
>>>>                                 /* just printing available regs */
>>>>                                 goto error;
>>>>                         }
>>>> +
>>>> +                       if (simd_mask) {
>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>> +                               if (has_simd_regs)
>>>> +                                       goto next;
>>>> +                       }
>>>> +                       if (pred_mask) {
>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>> +                               if (has_simd_regs)
>>>> +                                       goto next;
>>>> +                       }
>>>> +
>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>                                         break;
>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>                         }
>>>>
>>>>                         *mode |= r->mask;
>>>> -
>>>> +next:
>>>>                         if (!p)
>>>>                                 break;
>>>>
>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>         ret = 0;
>>>>
>>>>         /* default to all possible regs */
>>>> -       if (*mode == 0)
>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>                 *mode = mask;
>>>>  error:
>>>>         free(os);
>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>
>>>>         return ret;
>>>>  }
>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>> --- a/tools/perf/util/perf_regs.c
>>>> +++ b/tools/perf/util/perf_regs.c
>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>         return SDT_ARG_SKIP;
>>>>  }
>>>>
>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>> +{
>>>> +       return false;
>>>> +}
>>>> +
>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>  {
>>>>         return 0;
>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>         return 0;
>>>>  }
>>>>
>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>> +{
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>> +{
>>>> +       *qwords = 0;
>>>> +       return 0;
>>>> +}
>>>> +
>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>         SMPL_REG_END
>>>>  };
>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>         return sample_reg_masks;
>>>>  }
>>>>
>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>> +{
>>>> +       return sample_reg_masks;
>>>> +}
>>>> +
>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>> +{
>>>> +       return sample_reg_masks;
>>>> +}
>>>> +
>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>  {
>>>>         const char *reg_name = NULL;
>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>> --- a/tools/perf/util/perf_regs.h
>>>> +++ b/tools/perf/util/perf_regs.h
>>>> @@ -24,9 +24,20 @@ enum {
>>>>  };
>>>>
>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>> +bool arch_has_simd_regs(u64 mask);
>>>>  uint64_t arch__intr_reg_mask(void);
>>>>  uint64_t arch__user_reg_mask(void);
>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>> I wonder we can remove these functions. perf_reg_name(int id, uint16_t
>>> e_machine) maps a perf register number and e_machine to a string. So
>>> the sample_reg array could be replaced with:
>>> ```
>>> for (int perf_reg = 0; perf_reg < 64; perf_reg++) {
>>>   uint64_t mask = 1LL << perf_reg;
>>>   const char *name = perf_reg_name(perf_reg, EM_HOST);
>>>   if (name == NULL)
>>>     break;
>>>   // use mask and name
>>> ```
>>> To make it work for SIMD and PRED then I guess we need to iterate
>>> through the ABIs of enum perf_sample_regs_abi.
>> Suppose so.
>>
>>
>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>> +uint64_t arch__user_pred_reg_mask(void);
>>> I think some comments would be useful here like:
>>> ```
>>> /* Perf register bit map with valid bits for
>>> perf_event_attr.sample_regs_user. */
>>> uint64_t arch__intr_reg_mask(void);
>>> /* Perf register bit map with valid bits for
>>> perf_event_attr.sample_regs_intr. */
>>> uint64_t arch__user_reg_mask(void);
>>> /* Perf register bit map with valid bits for
>>> perf_event_attr.sample_simd_vec_reg_intr. */
>>> uint64_t arch__intr_simd_reg_mask(void);
>>> /* Perf register bit map with valid bits for
>>> perf_event_attr.sample_simd_vec_reg_user. */
>>> uint64_t arch__user_simd_reg_mask(void);
>>> /* Perf register bit map with valid bits for
>>> perf_event_attr.sample_simd_pred_reg_intr. */
>>> uint64_t arch__intr_pred_reg_mask(void);
>>> /* Perf register bit map with valid bits for
>>> perf_event_attr.sample_simd_pred_reg_user. */
>>> uint64_t arch__user_pred_reg_mask(void);
>> Sure. Thanks.
>>
>>
>>> ```
>>>
>>> Why do the arch__user_pred_reg_mask return a uint64_t when the
>>> perf_event_attr variable is a __u32?
>> Suppose it's a bug. :)
>>
>>
>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>> I don't understand this function. The qwords is specific to a
>>> perf_event_attr. We could have an evlist with an evsel set up to
>>> sample say XMM registers and another evsel set up to sample ZMM
>>> registers. Are the qwords here always for the ZMM case, or is XMM,
>>> YMM, ZMM depending on architecture support? Why does it vary per
>>> register? The surrounding code uses the term mask but here bitmap is
>>> used, is the inconsistency deliberate? Why are there user and intr
>>> functions when in the perf_event_attr there are only
>>> sample_simd_pred_reg_qwords and sample_simd_ved_reg_qwords variables?
>> These 4 functions is designed to get the bitmask and qwords length for a
>> specific SIMD registers. E.g., For XMM on x86 platforms, the returned
>> bitmask is 0xffff (xmm0 ~ xmm15) and the qwords length is 2 (128 bits). For
>> ZMM on x86 platforms, if the platform only supports 16 ZMM registers, then
>> the returned bitmask is 0xffff (zmm0 ~ zmm15) and qwords length is 8 (512
>> bits). If the platform supports 32 ZMM registers, then the returned bitmask
>> is 0xffffffff (zmm0 ~ zmm31) and qwords length is 8 (512 bits).
> What is the meaning of reg? In this file it is normally the integer
> index for a bit in the sample_regs_user mask, but for x86 I don't see
> enum perf_event_x86_regs having differing XMM, YMM and ZMM encodings.
> Similarly, is qwords an out argument, but then you also have the
> bitmap. It looks like the code is caching values but that assumes a
> single qword length for all events.

Yes, the "reg" argument indicates the SIMD register index. Strictly
speaking for x86 platform, the qwords length is fixed for a specific SIMD
register and only the register number could vary, e.g., some platforms
could only support 16 ZMM registers, but some other platforms could support
32 ZMM registers. But considering this is a generic function for all kinds
of archs, we can't ensure there are fixed length for a specific SIMD
register on any arch, so I introduce  the "qwords" argument to increase the
flexibility.

No, the qwords would be assigned to true register length if the register
exists on the platform, e.g., xmm = 2, ymm =  4 and zmm = 8. if the
register is not support on the platfom, the qwords would be set to 0.


>
>> Since the qword length is always fixed for any certain SIMD register
>> regardless of intr or user, so there is only one
>> sample_simd_pred_reg_qwords or sample_simd_ved_reg_qwords variable.
> Ok.  2 variables, but 4 functions here. I think there should just be 2
> because of this.

Yes, the user and intr variants would be merged into only one.


>
> Thanks,
> Ian
>
>>> Perhaps these functions should be something more like:
>>> ```
>>> /* Maximum value that can be assigned to
>>> perf_event_atttr.sample_simd_pred_reg_qwords. */
>>> uint16_t arch__simd_pred_reg_qwords_max(void);
>>> /* Maximum value that can be assigned to
>>> perf_event_atttr.sample_simd_vec_reg_qwords. */
>>> uint16_t arch__simd_vec_reg_qwords_max(void);
>>> ```
>>> Then the bitmap computation logic can all be moved into parse-regs-options.c.
>>>
>>> Thanks,
>>> Ian
>>>
>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>> --- a/tools/perf/util/record.h
>>>> +++ b/tools/perf/util/record.h
>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>         unsigned int  user_freq;
>>>>         u64           branch_stack;
>>>>         u64           sample_intr_regs;
>>>> +       u64           sample_intr_vec_regs;
>>>>         u64           sample_user_regs;
>>>> +       u64           sample_user_vec_regs;
>>>> +       u16           sample_pred_regs_qwords;
>>>> +       u16           sample_vec_regs_qwords;
>>>> +       u16           sample_intr_pred_regs;
>>>> +       u16           sample_user_pred_regs;
>>>>         u64           default_interval;
>>>>         u64           user_interval;
>>>>         size_t        auxtrace_snapshot_size;
>>>> --
>>>> 2.34.1
>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-21  5:17         ` Mi, Dapeng
@ 2026-01-21  7:09           ` Ian Rogers
  2026-01-21  7:52             ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-21  7:09 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Tue, Jan 20, 2026 at 9:17 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 1/21/2026 2:20 AM, Ian Rogers wrote:
> > On Tue, Jan 20, 2026 at 1:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 1/20/2026 3:39 PM, Ian Rogers wrote:
> >>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>
> >>>> This patch adds support for the newly introduced SIMD register sampling
> >>>> format by adding the following functions:
> >>>>
> >>>> uint64_t arch__intr_simd_reg_mask(void);
> >>>> uint64_t arch__user_simd_reg_mask(void);
> >>>> uint64_t arch__intr_pred_reg_mask(void);
> >>>> uint64_t arch__user_pred_reg_mask(void);
> >>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>
> >>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>>>
> >>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >>>> supported PRED registers, such as OPMASK on x86 platforms.
> >>>>
> >>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >>>> exact bitmap and number of qwords for a specific type of SIMD register.
> >>>> For example, for XMM registers on x86 platforms, the returned bitmap is
> >>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>>>
> >>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >>>> exact bitmap and number of qwords for a specific type of PRED register.
> >>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >>>> OPMASK).
> >>>>
> >>>> Additionally, the function __parse_regs() is enhanced to support parsing
> >>>> these newly introduced SIMD registers. Currently, each type of register
> >>>> can only be sampled collectively; sampling a specific SIMD register is
> >>>> not supported. For example, all XMM registers are sampled together rather
> >>>> than sampling only XMM0.
> >>>>
> >>>> When multiple overlapping register types, such as XMM and YMM, are
> >>>> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>>>
> >>>> With this patch, all supported sampling registers on x86 platforms are
> >>>> displayed as follows.
> >>>>
> >>>>  $perf record -I?
> >>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>
> >>>>  $perf record --user-regs=?
> >>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>
> >>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>> ---
> >>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>>>  tools/perf/util/evsel.c                   |  27 ++
> >>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>>>  tools/perf/util/perf_regs.c               |  59 +++
> >>>>  tools/perf/util/perf_regs.h               |  11 +
> >>>>  tools/perf/util/record.h                  |   6 +
> >>>>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>>>
> >>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >>>> index 12fd93f04802..db41430f3b07 100644
> >>>> --- a/tools/perf/arch/x86/util/perf_regs.c
> >>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >>>> @@ -13,6 +13,49 @@
> >>>>  #include "../../../util/pmu.h"
> >>>>  #include "../../../util/pmus.h"
> >>>>
> >>>> +static const struct sample_reg sample_reg_masks_ext[] = {
> >>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >>>> +#endif
> >>>> +       SMPL_REG_END
> >>>> +};
> >>>> +
> >>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>         SMPL_REG(BX, PERF_REG_X86_BX),
> >>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>>>         return SDT_ARG_VALID;
> >>>>  }
> >>>>
> >>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> >>>> +{
> >>>> +       struct perf_event_attr attr = {
> >>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>> +               .sample_type                    = sample_type,
> >>>> +               .disabled                       = 1,
> >>>> +               .exclude_kernel                 = 1,
> >>>> +               .sample_simd_regs_enabled       = 1,
> >>>> +       };
> >>>> +       int fd;
> >>>> +
> >>>> +       attr.sample_period = 1;
> >>>> +
> >>>> +       if (!pred) {
> >>>> +               attr.sample_simd_vec_reg_qwords = qwords;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       attr.sample_simd_vec_reg_intr = mask;
> >>>> +               else
> >>>> +                       attr.sample_simd_vec_reg_user = mask;
> >>>> +       } else {
> >>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >>>> +               else
> >>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >>>> +       }
> >>>> +
> >>>> +       if (perf_pmus__num_core_pmus() > 1) {
> >>>> +               struct perf_pmu *pmu = NULL;
> >>>> +               __u64 type = PERF_TYPE_RAW;
> >>>> +
> >>>> +               /*
> >>>> +                * The same register set is supported among different hybrid PMUs.
> >>>> +                * Only check the first available one.
> >>>> +                */
> >>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >>>> +                       type = pmu->type;
> >>>> +                       break;
> >>>> +               }
> >>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >>>> +       }
> >>>> +
> >>>> +       event_attr_init(&attr);
> >>>> +
> >>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>> +       if (fd != -1) {
> >>>> +               close(fd);
> >>>> +               return true;
> >>>> +       }
> >>>> +
> >>>> +       return false;
> >>>> +}
> >>>> +
> >>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>> +{
> >>>> +       bool supported = false;
> >>>> +       u64 bits;
> >>>> +
> >>>> +       *mask = 0;
> >>>> +       *qwords = 0;
> >>>> +
> >>>> +       switch (reg) {
> >>>> +       case PERF_REG_X86_XMM:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_XMM_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       case PERF_REG_X86_YMM:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_YMM_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       case PERF_REG_X86_ZMM:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >>>> +                       break;
> >>>> +               }
> >>>> +
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       default:
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       return supported;
> >>>> +}
> >>>> +
> >>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>> +{
> >>>> +       bool supported = false;
> >>>> +       u64 bits;
> >>>> +
> >>>> +       *mask = 0;
> >>>> +       *qwords = 0;
> >>>> +
> >>>> +       switch (reg) {
> >>>> +       case PERF_REG_X86_OPMASK:
> >>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >>>> +               if (supported) {
> >>>> +                       *mask = bits;
> >>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >>>> +               }
> >>>> +               break;
> >>>> +       default:
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       return supported;
> >>>> +}
> >>>> +
> >>>> +static bool has_cap_simd_regs(void)
> >>>> +{
> >>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >>>> +       static bool has_cap_simd_regs;
> >>>> +       static bool cached;
> >>>> +
> >>>> +       if (cached)
> >>>> +               return has_cap_simd_regs;
> >>>> +
> >>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>> +       cached = true;
> >>>> +
> >>>> +       return has_cap_simd_regs;
> >>>> +}
> >>>> +
> >>>> +bool arch_has_simd_regs(u64 mask)
> >>>> +{
> >>>> +       return has_cap_simd_regs() &&
> >>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >>>> +}
> >>>> +
> >>>> +static const struct sample_reg sample_simd_reg_masks[] = {
> >>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >>>> +       SMPL_REG_END
> >>>> +};
> >>>> +
> >>>> +static const struct sample_reg sample_pred_reg_masks[] = {
> >>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >>>> +       SMPL_REG_END
> >>>> +};
> >>>> +
> >>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >>>> +{
> >>>> +       return sample_simd_reg_masks;
> >>>> +}
> >>>> +
> >>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >>>> +{
> >>>> +       return sample_pred_reg_masks;
> >>>> +}
> >>>> +
> >>>> +static bool x86_intr_simd_updated;
> >>>> +static u64 x86_intr_simd_reg_mask;
> >>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>> +static bool x86_user_simd_updated;
> >>>> +static u64 x86_user_simd_reg_mask;
> >>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>> +
> >>>> +static bool x86_intr_pred_updated;
> >>>> +static u64 x86_intr_pred_reg_mask;
> >>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +static bool x86_user_pred_updated;
> >>>> +static u64 x86_user_pred_reg_mask;
> >>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>> +
> >>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool supported;
> >>>> +       u64 mask = 0;
> >>>> +       int reg;
> >>>> +
> >>>> +       if (!has_cap_simd_regs())
> >>>> +               return 0;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >>>> +               return x86_intr_simd_reg_mask;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >>>> +               return x86_user_simd_reg_mask;
> >>>> +
> >>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>> +               supported = false;
> >>>> +
> >>>> +               if (!r->mask)
> >>>> +                       continue;
> >>>> +               reg = fls64(r->mask) - 1;
> >>>> +
> >>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >>>> +                       break;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_intr_simd_mask[reg],
> >>>> +                                                        &x86_intr_simd_qwords[reg]);
> >>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_user_simd_mask[reg],
> >>>> +                                                        &x86_user_simd_qwords[reg]);
> >>>> +               if (supported)
> >>>> +                       mask |= BIT_ULL(reg);
> >>>> +       }
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>> +               x86_intr_simd_reg_mask = mask;
> >>>> +               x86_intr_simd_updated = true;
> >>>> +       } else {
> >>>> +               x86_user_simd_reg_mask = mask;
> >>>> +               x86_user_simd_updated = true;
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>>> +
> >>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool supported;
> >>>> +       u64 mask = 0;
> >>>> +       int reg;
> >>>> +
> >>>> +       if (!has_cap_simd_regs())
> >>>> +               return 0;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >>>> +               return x86_intr_pred_reg_mask;
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >>>> +               return x86_user_pred_reg_mask;
> >>>> +
> >>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>> +               supported = false;
> >>>> +
> >>>> +               if (!r->mask)
> >>>> +                       continue;
> >>>> +               reg = fls64(r->mask) - 1;
> >>>> +
> >>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >>>> +                       break;
> >>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_intr_pred_mask[reg],
> >>>> +                                                        &x86_intr_pred_qwords[reg]);
> >>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>> +                                                        &x86_user_pred_mask[reg],
> >>>> +                                                        &x86_user_pred_qwords[reg]);
> >>>> +               if (supported)
> >>>> +                       mask |= BIT_ULL(reg);
> >>>> +       }
> >>>> +
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>> +               x86_intr_pred_reg_mask = mask;
> >>>> +               x86_intr_pred_updated = true;
> >>>> +       } else {
> >>>> +               x86_user_pred_reg_mask = mask;
> >>>> +               x86_user_pred_updated = true;
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_simd_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_simd_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_pred_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_pred_reg_mask(void)
> >>>> +{
> >>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>> +}
> >>>> +
> >>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>> +{
> >>>> +       uint64_t mask = 0;
> >>>> +
> >>>> +       *qwords = 0;
> >>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >>>> +               if (intr) {
> >>>> +                       *qwords = x86_intr_simd_qwords[reg];
> >>>> +                       mask = x86_intr_simd_mask[reg];
> >>>> +               } else {
> >>>> +                       *qwords = x86_user_simd_qwords[reg];
> >>>> +                       mask = x86_user_simd_mask[reg];
> >>>> +               }
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>>> +
> >>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>> +{
> >>>> +       uint64_t mask = 0;
> >>>> +
> >>>> +       *qwords = 0;
> >>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >>>> +               if (intr) {
> >>>> +                       *qwords = x86_intr_pred_qwords[reg];
> >>>> +                       mask = x86_intr_pred_mask[reg];
> >>>> +               } else {
> >>>> +                       *qwords = x86_user_pred_qwords[reg];
> >>>> +                       mask = x86_user_pred_mask[reg];
> >>>> +               }
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_intr_simd_updated)
> >>>> +               arch__intr_simd_reg_mask();
> >>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_user_simd_updated)
> >>>> +               arch__user_simd_reg_mask();
> >>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_intr_pred_updated)
> >>>> +               arch__intr_pred_reg_mask();
> >>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>> +{
> >>>> +       if (!x86_user_pred_updated)
> >>>> +               arch__user_pred_reg_mask();
> >>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >>>> +}
> >>>> +
> >>>>  const struct sample_reg *arch__sample_reg_masks(void)
> >>>>  {
> >>>> +       if (has_cap_simd_regs())
> >>>> +               return sample_reg_masks_ext;
> >>>>         return sample_reg_masks;
> >>>>  }
> >>>>
> >>>> -uint64_t arch__intr_reg_mask(void)
> >>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>>>  {
> >>>>         struct perf_event_attr attr = {
> >>>> -               .type                   = PERF_TYPE_HARDWARE,
> >>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >>>> -               .precise_ip             = 1,
> >>>> -               .disabled               = 1,
> >>>> -               .exclude_kernel         = 1,
> >>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>> +               .sample_type                    = sample_type,
> >>>> +               .precise_ip                     = 1,
> >>>> +               .disabled                       = 1,
> >>>> +               .exclude_kernel                 = 1,
> >>>> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>>>         };
> >>>>         int fd;
> >>>>         /*
> >>>>          * In an unnamed union, init it here to build on older gcc versions
> >>>>          */
> >>>>         attr.sample_period = 1;
> >>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>> +               attr.sample_regs_intr = mask;
> >>>> +       else
> >>>> +               attr.sample_regs_user = mask;
> >>>>
> >>>>         if (perf_pmus__num_core_pmus() > 1) {
> >>>>                 struct perf_pmu *pmu = NULL;
> >>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>         if (fd != -1) {
> >>>>                 close(fd);
> >>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >>>> +               return mask;
> >>>>         }
> >>>>
> >>>> -       return PERF_REGS_MASK;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t arch__intr_reg_mask(void)
> >>>> +{
> >>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>> +
> >>>> +       if (has_cap_simd_regs()) {
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>> +                                        true);
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>> +                                        true);
> >>>> +       } else
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >>>> +
> >>>> +       return mask;
> >>>>  }
> >>>>
> >>>>  uint64_t arch__user_reg_mask(void)
> >>>>  {
> >>>> -       return PERF_REGS_MASK;
> >>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>> +
> >>>> +       if (has_cap_simd_regs()) {
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>> +                                        true);
> >>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>> +                                        true);
> >>>> +       }
> >>>> +
> >>>> +       return mask;
> >>>>  }
> >>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>> index 56ebefd075f2..5d1d90cf9488 100644
> >>>> --- a/tools/perf/util/evsel.c
> >>>> +++ b/tools/perf/util/evsel.c
> >>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>>>             !evsel__is_dummy_event(evsel)) {
> >>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >>>> +       }
> >>>> +
> >>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >>>> +               if (opts->sample_pred_regs_qwords)
> >>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>> +               else
> >>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>         }
> >>>>
> >>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>>>             !evsel__is_dummy_event(evsel)) {
> >>>>                 attr->sample_regs_user |= opts->sample_user_regs;
> >>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >>>> +               evsel__set_sample_bit(evsel, REGS_USER);
> >>>> +       }
> >>>> +
> >>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>> +               if (opts->sample_pred_regs_qwords)
> >>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>> +               else
> >>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>>>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>>>         }
> >>>>
> >>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >>>> index cda1c620968e..0bd100392889 100644
> >>>> --- a/tools/perf/util/parse-regs-options.c
> >>>> +++ b/tools/perf/util/parse-regs-options.c
> >>>> @@ -4,19 +4,139 @@
> >>>>  #include <stdint.h>
> >>>>  #include <string.h>
> >>>>  #include <stdio.h>
> >>>> +#include <linux/bitops.h>
> >>>>  #include "util/debug.h"
> >>>>  #include <subcmd/parse-options.h>
> >>>>  #include "util/perf_regs.h"
> >>>>  #include "util/parse-regs-options.h"
> >>>> +#include "record.h"
> >>>> +
> >>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       if (!simd_mask)
> >>>> +               return;
> >>>> +
> >>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>> +               if (!(r->mask & simd_mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               if (bitmap)
> >>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>> +       }
> >>>> +}
> >>>> +
> >>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       if (!pred_mask)
> >>>> +               return;
> >>>> +
> >>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>> +               if (!(r->mask & pred_mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               if (bitmap)
> >>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>> +       }
> >>>> +}
> >>>> +
> >>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool matched = false;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>> +               if (strcasecmp(s, r->name))
> >>>> +                       continue;
> >>>> +               if (!fls64(r->mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               matched = true;
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       /* Just need the highest qwords */
> >>>> +       if (qwords > opts->sample_vec_regs_qwords) {
> >>>> +               opts->sample_vec_regs_qwords = qwords;
> >>>> +               if (intr)
> >>>> +                       opts->sample_intr_vec_regs = bitmap;
> >>>> +               else
> >>>> +                       opts->sample_user_vec_regs = bitmap;
> >>>> +       }
> >>>> +
> >>>> +       return matched;
> >>>> +}
> >>>> +
> >>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >>>> +{
> >>>> +       const struct sample_reg *r = NULL;
> >>>> +       bool matched = false;
> >>>> +       uint64_t bitmap = 0;
> >>>> +       u16 qwords = 0;
> >>>> +       int reg_idx;
> >>>> +
> >>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>> +               if (strcasecmp(s, r->name))
> >>>> +                       continue;
> >>>> +               if (!fls64(r->mask))
> >>>> +                       continue;
> >>>> +               reg_idx = fls64(r->mask) - 1;
> >>>> +               if (intr)
> >>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               else
> >>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>> +               matched = true;
> >>>> +               break;
> >>>> +       }
> >>>> +
> >>>> +       /* Just need the highest qwords */
> >>>> +       if (qwords > opts->sample_pred_regs_qwords) {
> >>>> +               opts->sample_pred_regs_qwords = qwords;
> >>>> +               if (intr)
> >>>> +                       opts->sample_intr_pred_regs = bitmap;
> >>>> +               else
> >>>> +                       opts->sample_user_pred_regs = bitmap;
> >>>> +       }
> >>>> +
> >>>> +       return matched;
> >>>> +}
> >>>>
> >>>>  static int
> >>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>  {
> >>>>         uint64_t *mode = (uint64_t *)opt->value;
> >>>>         const struct sample_reg *r = NULL;
> >>>> +       struct record_opts *opts;
> >>>>         char *s, *os = NULL, *p;
> >>>> -       int ret = -1;
> >>>> +       bool has_simd_regs = false;
> >>>>         uint64_t mask;
> >>>> +       uint64_t simd_mask;
> >>>> +       uint64_t pred_mask;
> >>>> +       int ret = -1;
> >>>>
> >>>>         if (unset)
> >>>>                 return 0;
> >>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>         if (*mode)
> >>>>                 return -1;
> >>>>
> >>>> -       if (intr)
> >>>> +       if (intr) {
> >>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>>>                 mask = arch__intr_reg_mask();
> >>>> -       else
> >>>> +               simd_mask = arch__intr_simd_reg_mask();
> >>>> +               pred_mask = arch__intr_pred_reg_mask();
> >>>> +       } else {
> >>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>>>                 mask = arch__user_reg_mask();
> >>>> +               simd_mask = arch__user_simd_reg_mask();
> >>>> +               pred_mask = arch__user_pred_reg_mask();
> >>>> +       }
> >>>>
> >>>>         /* str may be NULL in case no arg is passed to -I */
> >>>>         if (str) {
> >>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>                                         if (r->mask & mask)
> >>>>                                                 fprintf(stderr, "%s ", r->name);
> >>>>                                 }
> >>>> +                               __print_simd_regs(intr, simd_mask);
> >>>> +                               __print_pred_regs(intr, pred_mask);
> >>>>                                 fputc('\n', stderr);
> >>>>                                 /* just printing available regs */
> >>>>                                 goto error;
> >>>>                         }
> >>>> +
> >>>> +                       if (simd_mask) {
> >>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >>>> +                               if (has_simd_regs)
> >>>> +                                       goto next;
> >>>> +                       }
> >>>> +                       if (pred_mask) {
> >>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >>>> +                               if (has_simd_regs)
> >>>> +                                       goto next;
> >>>> +                       }
> >>>> +
> >>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>>>                                         break;
> >>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>                         }
> >>>>
> >>>>                         *mode |= r->mask;
> >>>> -
> >>>> +next:
> >>>>                         if (!p)
> >>>>                                 break;
> >>>>
> >>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>         ret = 0;
> >>>>
> >>>>         /* default to all possible regs */
> >>>> -       if (*mode == 0)
> >>>> +       if (*mode == 0 && !has_simd_regs)
> >>>>                 *mode = mask;
> >>>>  error:
> >>>>         free(os);
> >>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >>>> index 66b666d9ce64..fb0366d050cf 100644
> >>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>>>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>>>         PRINT_ATTRf(aux_resume, p_unsigned);
> >>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>>>
> >>>>         return ret;
> >>>>  }
> >>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >>>> index 44b90bbf2d07..e8a9fabc92e6 100644
> >>>> --- a/tools/perf/util/perf_regs.c
> >>>> +++ b/tools/perf/util/perf_regs.c
> >>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>>>         return SDT_ARG_SKIP;
> >>>>  }
> >>>>
> >>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >>>> +{
> >>>> +       return false;
> >>>> +}
> >>>> +
> >>>>  uint64_t __weak arch__intr_reg_mask(void)
> >>>>  {
> >>>>         return 0;
> >>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>>>         return 0;
> >>>>  }
> >>>>
> >>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_simd_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_pred_reg_mask(void)
> >>>> +{
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>> +{
> >>>> +       *qwords = 0;
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>         SMPL_REG_END
> >>>>  };
> >>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>>>         return sample_reg_masks;
> >>>>  }
> >>>>
> >>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >>>> +{
> >>>> +       return sample_reg_masks;
> >>>> +}
> >>>> +
> >>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >>>> +{
> >>>> +       return sample_reg_masks;
> >>>> +}
> >>>> +
> >>>>  const char *perf_reg_name(int id, const char *arch)
> >>>>  {
> >>>>         const char *reg_name = NULL;
> >>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >>>> index f2d0736d65cc..bce9c4cfd1bf 100644
> >>>> --- a/tools/perf/util/perf_regs.h
> >>>> +++ b/tools/perf/util/perf_regs.h
> >>>> @@ -24,9 +24,20 @@ enum {
> >>>>  };
> >>>>
> >>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >>>> +bool arch_has_simd_regs(u64 mask);
> >>>>  uint64_t arch__intr_reg_mask(void);
> >>>>  uint64_t arch__user_reg_mask(void);
> >>>>  const struct sample_reg *arch__sample_reg_masks(void);
> >>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> >>> I wonder we can remove these functions. perf_reg_name(int id, uint16_t
> >>> e_machine) maps a perf register number and e_machine to a string. So
> >>> the sample_reg array could be replaced with:
> >>> ```
> >>> for (int perf_reg = 0; perf_reg < 64; perf_reg++) {
> >>>   uint64_t mask = 1LL << perf_reg;
> >>>   const char *name = perf_reg_name(perf_reg, EM_HOST);
> >>>   if (name == NULL)
> >>>     break;
> >>>   // use mask and name
> >>> ```
> >>> To make it work for SIMD and PRED then I guess we need to iterate
> >>> through the ABIs of enum perf_sample_regs_abi.
> >> Suppose so.
> >>
> >>
> >>>> +uint64_t arch__intr_simd_reg_mask(void);
> >>>> +uint64_t arch__user_simd_reg_mask(void);
> >>>> +uint64_t arch__intr_pred_reg_mask(void);
> >>>> +uint64_t arch__user_pred_reg_mask(void);
> >>> I think some comments would be useful here like:
> >>> ```
> >>> /* Perf register bit map with valid bits for
> >>> perf_event_attr.sample_regs_user. */
> >>> uint64_t arch__intr_reg_mask(void);
> >>> /* Perf register bit map with valid bits for
> >>> perf_event_attr.sample_regs_intr. */
> >>> uint64_t arch__user_reg_mask(void);
> >>> /* Perf register bit map with valid bits for
> >>> perf_event_attr.sample_simd_vec_reg_intr. */
> >>> uint64_t arch__intr_simd_reg_mask(void);
> >>> /* Perf register bit map with valid bits for
> >>> perf_event_attr.sample_simd_vec_reg_user. */
> >>> uint64_t arch__user_simd_reg_mask(void);
> >>> /* Perf register bit map with valid bits for
> >>> perf_event_attr.sample_simd_pred_reg_intr. */
> >>> uint64_t arch__intr_pred_reg_mask(void);
> >>> /* Perf register bit map with valid bits for
> >>> perf_event_attr.sample_simd_pred_reg_user. */
> >>> uint64_t arch__user_pred_reg_mask(void);
> >> Sure. Thanks.
> >>
> >>
> >>> ```
> >>>
> >>> Why do the arch__user_pred_reg_mask return a uint64_t when the
> >>> perf_event_attr variable is a __u32?
> >> Suppose it's a bug. :)
> >>
> >>
> >>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>> I don't understand this function. The qwords is specific to a
> >>> perf_event_attr. We could have an evlist with an evsel set up to
> >>> sample say XMM registers and another evsel set up to sample ZMM
> >>> registers. Are the qwords here always for the ZMM case, or is XMM,
> >>> YMM, ZMM depending on architecture support? Why does it vary per
> >>> register? The surrounding code uses the term mask but here bitmap is
> >>> used, is the inconsistency deliberate? Why are there user and intr
> >>> functions when in the perf_event_attr there are only
> >>> sample_simd_pred_reg_qwords and sample_simd_ved_reg_qwords variables?
> >> These 4 functions is designed to get the bitmask and qwords length for a
> >> specific SIMD registers. E.g., For XMM on x86 platforms, the returned
> >> bitmask is 0xffff (xmm0 ~ xmm15) and the qwords length is 2 (128 bits). For
> >> ZMM on x86 platforms, if the platform only supports 16 ZMM registers, then
> >> the returned bitmask is 0xffff (zmm0 ~ zmm15) and qwords length is 8 (512
> >> bits). If the platform supports 32 ZMM registers, then the returned bitmask
> >> is 0xffffffff (zmm0 ~ zmm31) and qwords length is 8 (512 bits).
> > What is the meaning of reg? In this file it is normally the integer
> > index for a bit in the sample_regs_user mask, but for x86 I don't see
> > enum perf_event_x86_regs having differing XMM, YMM and ZMM encodings.
> > Similarly, is qwords an out argument, but then you also have the
> > bitmap. It looks like the code is caching values but that assumes a
> > single qword length for all events.
>
> Yes, the "reg" argument indicates the SIMD register index. Strictly
> speaking for x86 platform, the qwords length is fixed for a specific SIMD
> register and only the register number could vary, e.g., some platforms
> could only support 16 ZMM registers, but some other platforms could support
> 32 ZMM registers. But considering this is a generic function for all kinds
> of archs, we can't ensure there are fixed length for a specific SIMD
> register on any arch, so I introduce  the "qwords" argument to increase the
> flexibility.

I'm still not understanding this still :-) What is a "SIMD register
index", the file is for perf registers and naturally enum
perf_event_x86_regs on x86, but that doesn't encode YMM and ZMM
registers. Perhaps you can give some examples?

How does the generic differing qword per register case get encoded
into a perf_event_attr? If it can't be then this seems like
functionality for no benefit. I also don't understand how the data in
the PERF_SAMPLE_REGS_USER part of a sample could be decoded as that is
assuming a constant qword number.

> No, the qwords would be assigned to true register length if the register
> exists on the platform, e.g., xmm = 2, ymm =  4 and zmm = 8. if the
> register is not support on the platfom, the qwords would be set to 0.

So it is a max function of the vector/pred qwords supported on the architecture.

> >
> >> Since the qword length is always fixed for any certain SIMD register
> >> regardless of intr or user, so there is only one
> >> sample_simd_pred_reg_qwords or sample_simd_ved_reg_qwords variable.
> > Ok.  2 variables, but 4 functions here. I think there should just be 2
> > because of this.
>
> Yes, the user and intr variants would be merged into only one.

Thanks,
Ian

> >
> > Thanks,
> > Ian
> >
> >>> Perhaps these functions should be something more like:
> >>> ```
> >>> /* Maximum value that can be assigned to
> >>> perf_event_atttr.sample_simd_pred_reg_qwords. */
> >>> uint16_t arch__simd_pred_reg_qwords_max(void);
> >>> /* Maximum value that can be assigned to
> >>> perf_event_atttr.sample_simd_vec_reg_qwords. */
> >>> uint16_t arch__simd_vec_reg_qwords_max(void);
> >>> ```
> >>> Then the bitmap computation logic can all be moved into parse-regs-options.c.
> >>>
> >>> Thanks,
> >>> Ian
> >>>
> >>>>  const char *perf_reg_name(int id, const char *arch);
> >>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >>>> index ea3a6c4657ee..825ffb4cc53f 100644
> >>>> --- a/tools/perf/util/record.h
> >>>> +++ b/tools/perf/util/record.h
> >>>> @@ -59,7 +59,13 @@ struct record_opts {
> >>>>         unsigned int  user_freq;
> >>>>         u64           branch_stack;
> >>>>         u64           sample_intr_regs;
> >>>> +       u64           sample_intr_vec_regs;
> >>>>         u64           sample_user_regs;
> >>>> +       u64           sample_user_vec_regs;
> >>>> +       u16           sample_pred_regs_qwords;
> >>>> +       u16           sample_vec_regs_qwords;
> >>>> +       u16           sample_intr_pred_regs;
> >>>> +       u16           sample_user_pred_regs;
> >>>>         u64           default_interval;
> >>>>         u64           user_interval;
> >>>>         size_t        auxtrace_snapshot_size;
> >>>> --
> >>>> 2.34.1
> >>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-21  7:09           ` Ian Rogers
@ 2026-01-21  7:52             ` Mi, Dapeng
  2026-01-21 14:48               ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-21  7:52 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/21/2026 3:09 PM, Ian Rogers wrote:
> On Tue, Jan 20, 2026 at 9:17 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 1/21/2026 2:20 AM, Ian Rogers wrote:
>>> On Tue, Jan 20, 2026 at 1:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 1/20/2026 3:39 PM, Ian Rogers wrote:
>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>
>>>>>> This patch adds support for the newly introduced SIMD register sampling
>>>>>> format by adding the following functions:
>>>>>>
>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>
>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>>>
>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>>>
>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>>>
>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>>>> OPMASK).
>>>>>>
>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>>>> these newly introduced SIMD registers. Currently, each type of register
>>>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>>>> not supported. For example, all XMM registers are sampled together rather
>>>>>> than sampling only XMM0.
>>>>>>
>>>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>>>
>>>>>> With this patch, all supported sampling registers on x86 platforms are
>>>>>> displayed as follows.
>>>>>>
>>>>>>  $perf record -I?
>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>
>>>>>>  $perf record --user-regs=?
>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>
>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>> ---
>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>>>  tools/perf/util/record.h                  |   6 +
>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>>>> index 12fd93f04802..db41430f3b07 100644
>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>>>> @@ -13,6 +13,49 @@
>>>>>>  #include "../../../util/pmu.h"
>>>>>>  #include "../../../util/pmus.h"
>>>>>>
>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>>>> +#endif
>>>>>> +       SMPL_REG_END
>>>>>> +};
>>>>>> +
>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>>>         return SDT_ARG_VALID;
>>>>>>  }
>>>>>>
>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>>>>> +{
>>>>>> +       struct perf_event_attr attr = {
>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>> +               .sample_type                    = sample_type,
>>>>>> +               .disabled                       = 1,
>>>>>> +               .exclude_kernel                 = 1,
>>>>>> +               .sample_simd_regs_enabled       = 1,
>>>>>> +       };
>>>>>> +       int fd;
>>>>>> +
>>>>>> +       attr.sample_period = 1;
>>>>>> +
>>>>>> +       if (!pred) {
>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>>>> +               else
>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>>>> +       } else {
>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>>>> +               else
>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>>>> +       }
>>>>>> +
>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>>>> +               struct perf_pmu *pmu = NULL;
>>>>>> +               __u64 type = PERF_TYPE_RAW;
>>>>>> +
>>>>>> +               /*
>>>>>> +                * The same register set is supported among different hybrid PMUs.
>>>>>> +                * Only check the first available one.
>>>>>> +                */
>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>>>> +                       type = pmu->type;
>>>>>> +                       break;
>>>>>> +               }
>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>>>> +       }
>>>>>> +
>>>>>> +       event_attr_init(&attr);
>>>>>> +
>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>> +       if (fd != -1) {
>>>>>> +               close(fd);
>>>>>> +               return true;
>>>>>> +       }
>>>>>> +
>>>>>> +       return false;
>>>>>> +}
>>>>>> +
>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>> +{
>>>>>> +       bool supported = false;
>>>>>> +       u64 bits;
>>>>>> +
>>>>>> +       *mask = 0;
>>>>>> +       *qwords = 0;
>>>>>> +
>>>>>> +       switch (reg) {
>>>>>> +       case PERF_REG_X86_XMM:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       case PERF_REG_X86_YMM:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       case PERF_REG_X86_ZMM:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>>>> +                       break;
>>>>>> +               }
>>>>>> +
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       default:
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       return supported;
>>>>>> +}
>>>>>> +
>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>> +{
>>>>>> +       bool supported = false;
>>>>>> +       u64 bits;
>>>>>> +
>>>>>> +       *mask = 0;
>>>>>> +       *qwords = 0;
>>>>>> +
>>>>>> +       switch (reg) {
>>>>>> +       case PERF_REG_X86_OPMASK:
>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>>>> +               if (supported) {
>>>>>> +                       *mask = bits;
>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +       default:
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       return supported;
>>>>>> +}
>>>>>> +
>>>>>> +static bool has_cap_simd_regs(void)
>>>>>> +{
>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>>>> +       static bool has_cap_simd_regs;
>>>>>> +       static bool cached;
>>>>>> +
>>>>>> +       if (cached)
>>>>>> +               return has_cap_simd_regs;
>>>>>> +
>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>> +       cached = true;
>>>>>> +
>>>>>> +       return has_cap_simd_regs;
>>>>>> +}
>>>>>> +
>>>>>> +bool arch_has_simd_regs(u64 mask)
>>>>>> +{
>>>>>> +       return has_cap_simd_regs() &&
>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>>>> +}
>>>>>> +
>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>>>> +       SMPL_REG_END
>>>>>> +};
>>>>>> +
>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>>>> +       SMPL_REG_END
>>>>>> +};
>>>>>> +
>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_simd_reg_masks;
>>>>>> +}
>>>>>> +
>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_pred_reg_masks;
>>>>>> +}
>>>>>> +
>>>>>> +static bool x86_intr_simd_updated;
>>>>>> +static u64 x86_intr_simd_reg_mask;
>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>> +static bool x86_user_simd_updated;
>>>>>> +static u64 x86_user_simd_reg_mask;
>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>> +
>>>>>> +static bool x86_intr_pred_updated;
>>>>>> +static u64 x86_intr_pred_reg_mask;
>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +static bool x86_user_pred_updated;
>>>>>> +static u64 x86_user_pred_reg_mask;
>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>> +
>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool supported;
>>>>>> +       u64 mask = 0;
>>>>>> +       int reg;
>>>>>> +
>>>>>> +       if (!has_cap_simd_regs())
>>>>>> +               return 0;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>>>> +               return x86_intr_simd_reg_mask;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>>>> +               return x86_user_simd_reg_mask;
>>>>>> +
>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>> +               supported = false;
>>>>>> +
>>>>>> +               if (!r->mask)
>>>>>> +                       continue;
>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>> +
>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>>>> +                       break;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_intr_simd_mask[reg],
>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_user_simd_mask[reg],
>>>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>>>> +               if (supported)
>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>> +       }
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>> +               x86_intr_simd_reg_mask = mask;
>>>>>> +               x86_intr_simd_updated = true;
>>>>>> +       } else {
>>>>>> +               x86_user_simd_reg_mask = mask;
>>>>>> +               x86_user_simd_updated = true;
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>>> +
>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool supported;
>>>>>> +       u64 mask = 0;
>>>>>> +       int reg;
>>>>>> +
>>>>>> +       if (!has_cap_simd_regs())
>>>>>> +               return 0;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>>>> +               return x86_intr_pred_reg_mask;
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>>>> +               return x86_user_pred_reg_mask;
>>>>>> +
>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>> +               supported = false;
>>>>>> +
>>>>>> +               if (!r->mask)
>>>>>> +                       continue;
>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>> +
>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>>>> +                       break;
>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_intr_pred_mask[reg],
>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>> +                                                        &x86_user_pred_mask[reg],
>>>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>>>> +               if (supported)
>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>> +       }
>>>>>> +
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>> +               x86_intr_pred_reg_mask = mask;
>>>>>> +               x86_intr_pred_updated = true;
>>>>>> +       } else {
>>>>>> +               x86_user_pred_reg_mask = mask;
>>>>>> +               x86_user_pred_updated = true;
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>> +}
>>>>>> +
>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>> +{
>>>>>> +       uint64_t mask = 0;
>>>>>> +
>>>>>> +       *qwords = 0;
>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>>>> +               if (intr) {
>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>>>> +                       mask = x86_intr_simd_mask[reg];
>>>>>> +               } else {
>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>>>> +                       mask = x86_user_simd_mask[reg];
>>>>>> +               }
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>>> +
>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>> +{
>>>>>> +       uint64_t mask = 0;
>>>>>> +
>>>>>> +       *qwords = 0;
>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>>>> +               if (intr) {
>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>>>> +                       mask = x86_intr_pred_mask[reg];
>>>>>> +               } else {
>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>>>> +                       mask = x86_user_pred_mask[reg];
>>>>>> +               }
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_intr_simd_updated)
>>>>>> +               arch__intr_simd_reg_mask();
>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_user_simd_updated)
>>>>>> +               arch__user_simd_reg_mask();
>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_intr_pred_updated)
>>>>>> +               arch__intr_pred_reg_mask();
>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>> +{
>>>>>> +       if (!x86_user_pred_updated)
>>>>>> +               arch__user_pred_reg_mask();
>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>>>> +}
>>>>>> +
>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>>>  {
>>>>>> +       if (has_cap_simd_regs())
>>>>>> +               return sample_reg_masks_ext;
>>>>>>         return sample_reg_masks;
>>>>>>  }
>>>>>>
>>>>>> -uint64_t arch__intr_reg_mask(void)
>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>>>  {
>>>>>>         struct perf_event_attr attr = {
>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>>>> -               .precise_ip             = 1,
>>>>>> -               .disabled               = 1,
>>>>>> -               .exclude_kernel         = 1,
>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>> +               .sample_type                    = sample_type,
>>>>>> +               .precise_ip                     = 1,
>>>>>> +               .disabled                       = 1,
>>>>>> +               .exclude_kernel                 = 1,
>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>>>         };
>>>>>>         int fd;
>>>>>>         /*
>>>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>>>          */
>>>>>>         attr.sample_period = 1;
>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>> +               attr.sample_regs_intr = mask;
>>>>>> +       else
>>>>>> +               attr.sample_regs_user = mask;
>>>>>>
>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>>>                 struct perf_pmu *pmu = NULL;
>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>         if (fd != -1) {
>>>>>>                 close(fd);
>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>>>> +               return mask;
>>>>>>         }
>>>>>>
>>>>>> -       return PERF_REGS_MASK;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t arch__intr_reg_mask(void)
>>>>>> +{
>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>> +
>>>>>> +       if (has_cap_simd_regs()) {
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>> +                                        true);
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>> +                                        true);
>>>>>> +       } else
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>>>> +
>>>>>> +       return mask;
>>>>>>  }
>>>>>>
>>>>>>  uint64_t arch__user_reg_mask(void)
>>>>>>  {
>>>>>> -       return PERF_REGS_MASK;
>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>> +
>>>>>> +       if (has_cap_simd_regs()) {
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>> +                                        true);
>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>> +                                        true);
>>>>>> +       }
>>>>>> +
>>>>>> +       return mask;
>>>>>>  }
>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>>>> --- a/tools/perf/util/evsel.c
>>>>>> +++ b/tools/perf/util/evsel.c
>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>> +       }
>>>>>> +
>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>> +               else
>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>         }
>>>>>>
>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>>>> +       }
>>>>>> +
>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>> +               else
>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>         }
>>>>>>
>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>>>> index cda1c620968e..0bd100392889 100644
>>>>>> --- a/tools/perf/util/parse-regs-options.c
>>>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>>>> @@ -4,19 +4,139 @@
>>>>>>  #include <stdint.h>
>>>>>>  #include <string.h>
>>>>>>  #include <stdio.h>
>>>>>> +#include <linux/bitops.h>
>>>>>>  #include "util/debug.h"
>>>>>>  #include <subcmd/parse-options.h>
>>>>>>  #include "util/perf_regs.h"
>>>>>>  #include "util/parse-regs-options.h"
>>>>>> +#include "record.h"
>>>>>> +
>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       if (!simd_mask)
>>>>>> +               return;
>>>>>> +
>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>> +               if (!(r->mask & simd_mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               if (bitmap)
>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>> +       }
>>>>>> +}
>>>>>> +
>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       if (!pred_mask)
>>>>>> +               return;
>>>>>> +
>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>> +               if (!(r->mask & pred_mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               if (bitmap)
>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>> +       }
>>>>>> +}
>>>>>> +
>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool matched = false;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>> +               if (strcasecmp(s, r->name))
>>>>>> +                       continue;
>>>>>> +               if (!fls64(r->mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               matched = true;
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       /* Just need the highest qwords */
>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>>>> +               if (intr)
>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>>>> +               else
>>>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>>>> +       }
>>>>>> +
>>>>>> +       return matched;
>>>>>> +}
>>>>>> +
>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>>>> +{
>>>>>> +       const struct sample_reg *r = NULL;
>>>>>> +       bool matched = false;
>>>>>> +       uint64_t bitmap = 0;
>>>>>> +       u16 qwords = 0;
>>>>>> +       int reg_idx;
>>>>>> +
>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>> +               if (strcasecmp(s, r->name))
>>>>>> +                       continue;
>>>>>> +               if (!fls64(r->mask))
>>>>>> +                       continue;
>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>> +               if (intr)
>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               else
>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>> +               matched = true;
>>>>>> +               break;
>>>>>> +       }
>>>>>> +
>>>>>> +       /* Just need the highest qwords */
>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>>>> +               if (intr)
>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>>>> +               else
>>>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>>>> +       }
>>>>>> +
>>>>>> +       return matched;
>>>>>> +}
>>>>>>
>>>>>>  static int
>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>  {
>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>>>         const struct sample_reg *r = NULL;
>>>>>> +       struct record_opts *opts;
>>>>>>         char *s, *os = NULL, *p;
>>>>>> -       int ret = -1;
>>>>>> +       bool has_simd_regs = false;
>>>>>>         uint64_t mask;
>>>>>> +       uint64_t simd_mask;
>>>>>> +       uint64_t pred_mask;
>>>>>> +       int ret = -1;
>>>>>>
>>>>>>         if (unset)
>>>>>>                 return 0;
>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>         if (*mode)
>>>>>>                 return -1;
>>>>>>
>>>>>> -       if (intr)
>>>>>> +       if (intr) {
>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>>>                 mask = arch__intr_reg_mask();
>>>>>> -       else
>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>>>> +       } else {
>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>>>                 mask = arch__user_reg_mask();
>>>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>>>> +       }
>>>>>>
>>>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>>>         if (str) {
>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>                                         if (r->mask & mask)
>>>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>>>                                 }
>>>>>> +                               __print_simd_regs(intr, simd_mask);
>>>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>>>                                 fputc('\n', stderr);
>>>>>>                                 /* just printing available regs */
>>>>>>                                 goto error;
>>>>>>                         }
>>>>>> +
>>>>>> +                       if (simd_mask) {
>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>>>> +                               if (has_simd_regs)
>>>>>> +                                       goto next;
>>>>>> +                       }
>>>>>> +                       if (pred_mask) {
>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>>>> +                               if (has_simd_regs)
>>>>>> +                                       goto next;
>>>>>> +                       }
>>>>>> +
>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>>>                                         break;
>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>                         }
>>>>>>
>>>>>>                         *mode |= r->mask;
>>>>>> -
>>>>>> +next:
>>>>>>                         if (!p)
>>>>>>                                 break;
>>>>>>
>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>         ret = 0;
>>>>>>
>>>>>>         /* default to all possible regs */
>>>>>> -       if (*mode == 0)
>>>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>>>                 *mode = mask;
>>>>>>  error:
>>>>>>         free(os);
>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>>>
>>>>>>         return ret;
>>>>>>  }
>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>>>> --- a/tools/perf/util/perf_regs.c
>>>>>> +++ b/tools/perf/util/perf_regs.c
>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>>>         return SDT_ARG_SKIP;
>>>>>>  }
>>>>>>
>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>>>> +{
>>>>>> +       return false;
>>>>>> +}
>>>>>> +
>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>>>  {
>>>>>>         return 0;
>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>>>         return 0;
>>>>>>  }
>>>>>>
>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>>>> +{
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>> +{
>>>>>> +       *qwords = 0;
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>         SMPL_REG_END
>>>>>>  };
>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>>>         return sample_reg_masks;
>>>>>>  }
>>>>>>
>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_reg_masks;
>>>>>> +}
>>>>>> +
>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>>>> +{
>>>>>> +       return sample_reg_masks;
>>>>>> +}
>>>>>> +
>>>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>>>  {
>>>>>>         const char *reg_name = NULL;
>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>>>> --- a/tools/perf/util/perf_regs.h
>>>>>> +++ b/tools/perf/util/perf_regs.h
>>>>>> @@ -24,9 +24,20 @@ enum {
>>>>>>  };
>>>>>>
>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>>>> +bool arch_has_simd_regs(u64 mask);
>>>>>>  uint64_t arch__intr_reg_mask(void);
>>>>>>  uint64_t arch__user_reg_mask(void);
>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>>> I wonder we can remove these functions. perf_reg_name(int id, uint16_t
>>>>> e_machine) maps a perf register number and e_machine to a string. So
>>>>> the sample_reg array could be replaced with:
>>>>> ```
>>>>> for (int perf_reg = 0; perf_reg < 64; perf_reg++) {
>>>>>   uint64_t mask = 1LL << perf_reg;
>>>>>   const char *name = perf_reg_name(perf_reg, EM_HOST);
>>>>>   if (name == NULL)
>>>>>     break;
>>>>>   // use mask and name
>>>>> ```
>>>>> To make it work for SIMD and PRED then I guess we need to iterate
>>>>> through the ABIs of enum perf_sample_regs_abi.
>>>> Suppose so.
>>>>
>>>>
>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>>> I think some comments would be useful here like:
>>>>> ```
>>>>> /* Perf register bit map with valid bits for
>>>>> perf_event_attr.sample_regs_user. */
>>>>> uint64_t arch__intr_reg_mask(void);
>>>>> /* Perf register bit map with valid bits for
>>>>> perf_event_attr.sample_regs_intr. */
>>>>> uint64_t arch__user_reg_mask(void);
>>>>> /* Perf register bit map with valid bits for
>>>>> perf_event_attr.sample_simd_vec_reg_intr. */
>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>> /* Perf register bit map with valid bits for
>>>>> perf_event_attr.sample_simd_vec_reg_user. */
>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>> /* Perf register bit map with valid bits for
>>>>> perf_event_attr.sample_simd_pred_reg_intr. */
>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>> /* Perf register bit map with valid bits for
>>>>> perf_event_attr.sample_simd_pred_reg_user. */
>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>> Sure. Thanks.
>>>>
>>>>
>>>>> ```
>>>>>
>>>>> Why do the arch__user_pred_reg_mask return a uint64_t when the
>>>>> perf_event_attr variable is a __u32?
>>>> Suppose it's a bug. :)
>>>>
>>>>
>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>> I don't understand this function. The qwords is specific to a
>>>>> perf_event_attr. We could have an evlist with an evsel set up to
>>>>> sample say XMM registers and another evsel set up to sample ZMM
>>>>> registers. Are the qwords here always for the ZMM case, or is XMM,
>>>>> YMM, ZMM depending on architecture support? Why does it vary per
>>>>> register? The surrounding code uses the term mask but here bitmap is
>>>>> used, is the inconsistency deliberate? Why are there user and intr
>>>>> functions when in the perf_event_attr there are only
>>>>> sample_simd_pred_reg_qwords and sample_simd_ved_reg_qwords variables?
>>>> These 4 functions is designed to get the bitmask and qwords length for a
>>>> specific SIMD registers. E.g., For XMM on x86 platforms, the returned
>>>> bitmask is 0xffff (xmm0 ~ xmm15) and the qwords length is 2 (128 bits). For
>>>> ZMM on x86 platforms, if the platform only supports 16 ZMM registers, then
>>>> the returned bitmask is 0xffff (zmm0 ~ zmm15) and qwords length is 8 (512
>>>> bits). If the platform supports 32 ZMM registers, then the returned bitmask
>>>> is 0xffffffff (zmm0 ~ zmm31) and qwords length is 8 (512 bits).
>>> What is the meaning of reg? In this file it is normally the integer
>>> index for a bit in the sample_regs_user mask, but for x86 I don't see
>>> enum perf_event_x86_regs having differing XMM, YMM and ZMM encodings.
>>> Similarly, is qwords an out argument, but then you also have the
>>> bitmap. It looks like the code is caching values but that assumes a
>>> single qword length for all events.
>> Yes, the "reg" argument indicates the SIMD register index. Strictly
>> speaking for x86 platform, the qwords length is fixed for a specific SIMD
>> register and only the register number could vary, e.g., some platforms
>> could only support 16 ZMM registers, but some other platforms could support
>> 32 ZMM registers. But considering this is a generic function for all kinds
>> of archs, we can't ensure there are fixed length for a specific SIMD
>> register on any arch, so I introduce  the "qwords" argument to increase the
>> flexibility.
> I'm still not understanding this still :-) What is a "SIMD register
> index", the file is for perf registers and naturally enum
> perf_event_x86_regs on x86, but that doesn't encode YMM and ZMM
> registers. Perhaps you can give some examples?

Yes, it's something just like the register index in the enum
perf_event_x86_regs, e.g. the index of AX register is PERF_REG_X86_AX, the
index of BX is PERF_REG_X86_BX, and so on.

But the difference is that each index in the perf_event_x86_regs can only
represent a u64 word. Assume we still want to represent the SIMD registers
with the perf_event_x86_regs enum, then each XMM register needs 2 indexes,
each YMM register needs 4 indexes and each ZMM needs 8 indexes. Considering
there are 16 XMM registers, 16 YMM registers and 32 ZMM registers. To
represent all these indexes, then the enum perf_event_x86_regs would become
quite large, and correspondingly the sample_regs_intr/sample_regs_user
fields in the perf_event_attr would have to inflate much. That would
consume much memory.

So that's why we introduce the new below attributes.

+ union { + __u16 sample_simd_regs_enabled; + __u16
sample_simd_pred_reg_qwords; + }; + __u32 sample_simd_pred_reg_intr; +
__u32 sample_simd_pred_reg_user; + __u16 sample_simd_vec_reg_qwords; +
__u64 sample_simd_vec_reg_intr; + __u64 sample_simd_vec_reg_user; + __u32
__reserved_4; For SIMD registers, each kind of SIMD register would be
treated as a whole. The sample_simd_vec_reg_qwords would be used to
identify the length of SIMD register, simultaneously it also hint which
kind of SIMD register it is since the length of each kind of SIMD register
is different. E.g. we want to sample XMM registers. We know there are 16
XMM registers on the x86 platform and qwords length of XMM register is 2.
So user space needs to set the attributes like this,

sample_simd_vec_reg_intr = 0xffff;

sample_simd_vec_reg_qwords = 2;

Come back to "reg" argument, we know there could be multiple kinds of SIMD
registers supported on some kind of arch, e.g., x86 support XMM, YMM, ZMM
and OPMASK SIMD registers. As each kind of SIMD register is always sampled
as a whole, we don't need to represent each of SIMD register, like XMM0,
XMM1, but we indeed need to distinguish different kinds of SIMD register,
like XMM and YMM registers, since they have different register length and
number.

That's why we define the index for each kind of SIMD register, like below,

+enum { + PERF_REG_X86_XMM, + PERF_REG_X86_YMM, + PERF_REG_X86_ZMM, +
PERF_REG_X86_MAX_SIMD_REGS, + + PERF_REG_X86_OPMASK = 0, +
PERF_REG_X86_MAX_PRED_REGS = 1, +}; It's similar withperf_event_x86_regs, but each index represents a kind of SIMD register instead of a specific SIMD register.


>
> How does the generic differing qword per register case get encoded
> into a perf_event_attr? If it can't be then this seems like
> functionality for no benefit. I also don't understand how the data in
> the PERF_SAMPLE_REGS_USER part of a sample could be decoded as that is
> assuming a constant qword number.
>
>> No, the qwords would be assigned to true register length if the register
>> exists on the platform, e.g., xmm = 2, ymm =  4 and zmm = 8. if the
>> register is not support on the platfom, the qwords would be set to 0.
> So it is a max function of the vector/pred qwords supported on the architecture.

Strictly speaking, it's not "max" function of the vector/pred qwords, it's
just a function to get the exact vector/pred qwords supported on the
architecture since qwords length won't vary for a fixed kind of SIMD register.


>
>>>> Since the qword length is always fixed for any certain SIMD register
>>>> regardless of intr or user, so there is only one
>>>> sample_simd_pred_reg_qwords or sample_simd_ved_reg_qwords variable.
>>> Ok.  2 variables, but 4 functions here. I think there should just be 2
>>> because of this.
>> Yes, the user and intr variants would be merged into only one.
> Thanks,
> Ian
>
>>> Thanks,
>>> Ian
>>>
>>>>> Perhaps these functions should be something more like:
>>>>> ```
>>>>> /* Maximum value that can be assigned to
>>>>> perf_event_atttr.sample_simd_pred_reg_qwords. */
>>>>> uint16_t arch__simd_pred_reg_qwords_max(void);
>>>>> /* Maximum value that can be assigned to
>>>>> perf_event_atttr.sample_simd_vec_reg_qwords. */
>>>>> uint16_t arch__simd_vec_reg_qwords_max(void);
>>>>> ```
>>>>> Then the bitmap computation logic can all be moved into parse-regs-options.c.
>>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>>>> --- a/tools/perf/util/record.h
>>>>>> +++ b/tools/perf/util/record.h
>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>>>         unsigned int  user_freq;
>>>>>>         u64           branch_stack;
>>>>>>         u64           sample_intr_regs;
>>>>>> +       u64           sample_intr_vec_regs;
>>>>>>         u64           sample_user_regs;
>>>>>> +       u64           sample_user_vec_regs;
>>>>>> +       u16           sample_pred_regs_qwords;
>>>>>> +       u16           sample_vec_regs_qwords;
>>>>>> +       u16           sample_intr_pred_regs;
>>>>>> +       u16           sample_user_pred_regs;
>>>>>>         u64           default_interval;
>>>>>>         u64           user_interval;
>>>>>>         size_t        auxtrace_snapshot_size;
>>>>>> --
>>>>>> 2.34.1
>>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-21  7:52             ` Mi, Dapeng
@ 2026-01-21 14:48               ` Ian Rogers
  2026-01-22  1:49                 ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-21 14:48 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Tue, Jan 20, 2026 at 11:52 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 1/21/2026 3:09 PM, Ian Rogers wrote:
> > On Tue, Jan 20, 2026 at 9:17 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 1/21/2026 2:20 AM, Ian Rogers wrote:
> >>> On Tue, Jan 20, 2026 at 1:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>> On 1/20/2026 3:39 PM, Ian Rogers wrote:
> >>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >>>>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>>>
> >>>>>> This patch adds support for the newly introduced SIMD register sampling
> >>>>>> format by adding the following functions:
> >>>>>>
> >>>>>> uint64_t arch__intr_simd_reg_mask(void);
> >>>>>> uint64_t arch__user_simd_reg_mask(void);
> >>>>>> uint64_t arch__intr_pred_reg_mask(void);
> >>>>>> uint64_t arch__user_pred_reg_mask(void);
> >>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>
> >>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>>>>>
> >>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >>>>>> supported PRED registers, such as OPMASK on x86 platforms.
> >>>>>>
> >>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
> >>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
> >>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>>>>>
> >>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >>>>>> exact bitmap and number of qwords for a specific type of PRED register.
> >>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >>>>>> OPMASK).
> >>>>>>
> >>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
> >>>>>> these newly introduced SIMD registers. Currently, each type of register
> >>>>>> can only be sampled collectively; sampling a specific SIMD register is
> >>>>>> not supported. For example, all XMM registers are sampled together rather
> >>>>>> than sampling only XMM0.
> >>>>>>
> >>>>>> When multiple overlapping register types, such as XMM and YMM, are
> >>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>>>>>
> >>>>>> With this patch, all supported sampling registers on x86 platforms are
> >>>>>> displayed as follows.
> >>>>>>
> >>>>>>  $perf record -I?
> >>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>
> >>>>>>  $perf record --user-regs=?
> >>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>
> >>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>> ---
> >>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>>>>>  tools/perf/util/evsel.c                   |  27 ++
> >>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>>>>>  tools/perf/util/perf_regs.c               |  59 +++
> >>>>>>  tools/perf/util/perf_regs.h               |  11 +
> >>>>>>  tools/perf/util/record.h                  |   6 +
> >>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>>>>>
> >>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>> index 12fd93f04802..db41430f3b07 100644
> >>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
> >>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>> @@ -13,6 +13,49 @@
> >>>>>>  #include "../../../util/pmu.h"
> >>>>>>  #include "../../../util/pmus.h"
> >>>>>>
> >>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
> >>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >>>>>> +#endif
> >>>>>> +       SMPL_REG_END
> >>>>>> +};
> >>>>>> +
> >>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>>>>>         return SDT_ARG_VALID;
> >>>>>>  }
> >>>>>>
> >>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> >>>>>> +{
> >>>>>> +       struct perf_event_attr attr = {
> >>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>> +               .sample_type                    = sample_type,
> >>>>>> +               .disabled                       = 1,
> >>>>>> +               .exclude_kernel                 = 1,
> >>>>>> +               .sample_simd_regs_enabled       = 1,
> >>>>>> +       };
> >>>>>> +       int fd;
> >>>>>> +
> >>>>>> +       attr.sample_period = 1;
> >>>>>> +
> >>>>>> +       if (!pred) {
> >>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
> >>>>>> +               else
> >>>>>> +                       attr.sample_simd_vec_reg_user = mask;
> >>>>>> +       } else {
> >>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >>>>>> +               else
> >>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
> >>>>>> +               struct perf_pmu *pmu = NULL;
> >>>>>> +               __u64 type = PERF_TYPE_RAW;
> >>>>>> +
> >>>>>> +               /*
> >>>>>> +                * The same register set is supported among different hybrid PMUs.
> >>>>>> +                * Only check the first available one.
> >>>>>> +                */
> >>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >>>>>> +                       type = pmu->type;
> >>>>>> +                       break;
> >>>>>> +               }
> >>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       event_attr_init(&attr);
> >>>>>> +
> >>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>> +       if (fd != -1) {
> >>>>>> +               close(fd);
> >>>>>> +               return true;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return false;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>> +{
> >>>>>> +       bool supported = false;
> >>>>>> +       u64 bits;
> >>>>>> +
> >>>>>> +       *mask = 0;
> >>>>>> +       *qwords = 0;
> >>>>>> +
> >>>>>> +       switch (reg) {
> >>>>>> +       case PERF_REG_X86_XMM:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       case PERF_REG_X86_YMM:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       case PERF_REG_X86_ZMM:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >>>>>> +                       break;
> >>>>>> +               }
> >>>>>> +
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       default:
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return supported;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>> +{
> >>>>>> +       bool supported = false;
> >>>>>> +       u64 bits;
> >>>>>> +
> >>>>>> +       *mask = 0;
> >>>>>> +       *qwords = 0;
> >>>>>> +
> >>>>>> +       switch (reg) {
> >>>>>> +       case PERF_REG_X86_OPMASK:
> >>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >>>>>> +               if (supported) {
> >>>>>> +                       *mask = bits;
> >>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +       default:
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return supported;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool has_cap_simd_regs(void)
> >>>>>> +{
> >>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >>>>>> +       static bool has_cap_simd_regs;
> >>>>>> +       static bool cached;
> >>>>>> +
> >>>>>> +       if (cached)
> >>>>>> +               return has_cap_simd_regs;
> >>>>>> +
> >>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>> +       cached = true;
> >>>>>> +
> >>>>>> +       return has_cap_simd_regs;
> >>>>>> +}
> >>>>>> +
> >>>>>> +bool arch_has_simd_regs(u64 mask)
> >>>>>> +{
> >>>>>> +       return has_cap_simd_regs() &&
> >>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
> >>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >>>>>> +       SMPL_REG_END
> >>>>>> +};
> >>>>>> +
> >>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
> >>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >>>>>> +       SMPL_REG_END
> >>>>>> +};
> >>>>>> +
> >>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_simd_reg_masks;
> >>>>>> +}
> >>>>>> +
> >>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_pred_reg_masks;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool x86_intr_simd_updated;
> >>>>>> +static u64 x86_intr_simd_reg_mask;
> >>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>> +static bool x86_user_simd_updated;
> >>>>>> +static u64 x86_user_simd_reg_mask;
> >>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>> +
> >>>>>> +static bool x86_intr_pred_updated;
> >>>>>> +static u64 x86_intr_pred_reg_mask;
> >>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +static bool x86_user_pred_updated;
> >>>>>> +static u64 x86_user_pred_reg_mask;
> >>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>> +
> >>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool supported;
> >>>>>> +       u64 mask = 0;
> >>>>>> +       int reg;
> >>>>>> +
> >>>>>> +       if (!has_cap_simd_regs())
> >>>>>> +               return 0;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >>>>>> +               return x86_intr_simd_reg_mask;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >>>>>> +               return x86_user_simd_reg_mask;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>> +               supported = false;
> >>>>>> +
> >>>>>> +               if (!r->mask)
> >>>>>> +                       continue;
> >>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>> +
> >>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >>>>>> +                       break;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_intr_simd_mask[reg],
> >>>>>> +                                                        &x86_intr_simd_qwords[reg]);
> >>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_user_simd_mask[reg],
> >>>>>> +                                                        &x86_user_simd_qwords[reg]);
> >>>>>> +               if (supported)
> >>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>> +               x86_intr_simd_reg_mask = mask;
> >>>>>> +               x86_intr_simd_updated = true;
> >>>>>> +       } else {
> >>>>>> +               x86_user_simd_reg_mask = mask;
> >>>>>> +               x86_user_simd_updated = true;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool supported;
> >>>>>> +       u64 mask = 0;
> >>>>>> +       int reg;
> >>>>>> +
> >>>>>> +       if (!has_cap_simd_regs())
> >>>>>> +               return 0;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >>>>>> +               return x86_intr_pred_reg_mask;
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >>>>>> +               return x86_user_pred_reg_mask;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>> +               supported = false;
> >>>>>> +
> >>>>>> +               if (!r->mask)
> >>>>>> +                       continue;
> >>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>> +
> >>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >>>>>> +                       break;
> >>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_intr_pred_mask[reg],
> >>>>>> +                                                        &x86_intr_pred_qwords[reg]);
> >>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>> +                                                        &x86_user_pred_mask[reg],
> >>>>>> +                                                        &x86_user_pred_qwords[reg]);
> >>>>>> +               if (supported)
> >>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>> +               x86_intr_pred_reg_mask = mask;
> >>>>>> +               x86_intr_pred_updated = true;
> >>>>>> +       } else {
> >>>>>> +               x86_user_pred_reg_mask = mask;
> >>>>>> +               x86_user_pred_updated = true;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>> +{
> >>>>>> +       uint64_t mask = 0;
> >>>>>> +
> >>>>>> +       *qwords = 0;
> >>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >>>>>> +               if (intr) {
> >>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
> >>>>>> +                       mask = x86_intr_simd_mask[reg];
> >>>>>> +               } else {
> >>>>>> +                       *qwords = x86_user_simd_qwords[reg];
> >>>>>> +                       mask = x86_user_simd_mask[reg];
> >>>>>> +               }
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>> +{
> >>>>>> +       uint64_t mask = 0;
> >>>>>> +
> >>>>>> +       *qwords = 0;
> >>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >>>>>> +               if (intr) {
> >>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
> >>>>>> +                       mask = x86_intr_pred_mask[reg];
> >>>>>> +               } else {
> >>>>>> +                       *qwords = x86_user_pred_qwords[reg];
> >>>>>> +                       mask = x86_user_pred_mask[reg];
> >>>>>> +               }
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_intr_simd_updated)
> >>>>>> +               arch__intr_simd_reg_mask();
> >>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_user_simd_updated)
> >>>>>> +               arch__user_simd_reg_mask();
> >>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_intr_pred_updated)
> >>>>>> +               arch__intr_pred_reg_mask();
> >>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>> +{
> >>>>>> +       if (!x86_user_pred_updated)
> >>>>>> +               arch__user_pred_reg_mask();
> >>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >>>>>> +}
> >>>>>> +
> >>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
> >>>>>>  {
> >>>>>> +       if (has_cap_simd_regs())
> >>>>>> +               return sample_reg_masks_ext;
> >>>>>>         return sample_reg_masks;
> >>>>>>  }
> >>>>>>
> >>>>>> -uint64_t arch__intr_reg_mask(void)
> >>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>>>>>  {
> >>>>>>         struct perf_event_attr attr = {
> >>>>>> -               .type                   = PERF_TYPE_HARDWARE,
> >>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >>>>>> -               .precise_ip             = 1,
> >>>>>> -               .disabled               = 1,
> >>>>>> -               .exclude_kernel         = 1,
> >>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>> +               .sample_type                    = sample_type,
> >>>>>> +               .precise_ip                     = 1,
> >>>>>> +               .disabled                       = 1,
> >>>>>> +               .exclude_kernel                 = 1,
> >>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>>>>>         };
> >>>>>>         int fd;
> >>>>>>         /*
> >>>>>>          * In an unnamed union, init it here to build on older gcc versions
> >>>>>>          */
> >>>>>>         attr.sample_period = 1;
> >>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>> +               attr.sample_regs_intr = mask;
> >>>>>> +       else
> >>>>>> +               attr.sample_regs_user = mask;
> >>>>>>
> >>>>>>         if (perf_pmus__num_core_pmus() > 1) {
> >>>>>>                 struct perf_pmu *pmu = NULL;
> >>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>>         if (fd != -1) {
> >>>>>>                 close(fd);
> >>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >>>>>> +               return mask;
> >>>>>>         }
> >>>>>>
> >>>>>> -       return PERF_REGS_MASK;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t arch__intr_reg_mask(void)
> >>>>>> +{
> >>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>> +
> >>>>>> +       if (has_cap_simd_regs()) {
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>> +                                        true);
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>> +                                        true);
> >>>>>> +       } else
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>>  }
> >>>>>>
> >>>>>>  uint64_t arch__user_reg_mask(void)
> >>>>>>  {
> >>>>>> -       return PERF_REGS_MASK;
> >>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>> +
> >>>>>> +       if (has_cap_simd_regs()) {
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>> +                                        true);
> >>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>> +                                        true);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return mask;
> >>>>>>  }
> >>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>>>> index 56ebefd075f2..5d1d90cf9488 100644
> >>>>>> --- a/tools/perf/util/evsel.c
> >>>>>> +++ b/tools/perf/util/evsel.c
> >>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>> +               else
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>>         }
> >>>>>>
> >>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
> >>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>> +               else
> >>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>>         }
> >>>>>>
> >>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >>>>>> index cda1c620968e..0bd100392889 100644
> >>>>>> --- a/tools/perf/util/parse-regs-options.c
> >>>>>> +++ b/tools/perf/util/parse-regs-options.c
> >>>>>> @@ -4,19 +4,139 @@
> >>>>>>  #include <stdint.h>
> >>>>>>  #include <string.h>
> >>>>>>  #include <stdio.h>
> >>>>>> +#include <linux/bitops.h>
> >>>>>>  #include "util/debug.h"
> >>>>>>  #include <subcmd/parse-options.h>
> >>>>>>  #include "util/perf_regs.h"
> >>>>>>  #include "util/parse-regs-options.h"
> >>>>>> +#include "record.h"
> >>>>>> +
> >>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       if (!simd_mask)
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>> +               if (!(r->mask & simd_mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               if (bitmap)
> >>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>> +       }
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       if (!pred_mask)
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>> +               if (!(r->mask & pred_mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               if (bitmap)
> >>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>> +       }
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool matched = false;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>> +               if (strcasecmp(s, r->name))
> >>>>>> +                       continue;
> >>>>>> +               if (!fls64(r->mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               matched = true;
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       /* Just need the highest qwords */
> >>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
> >>>>>> +               opts->sample_vec_regs_qwords = qwords;
> >>>>>> +               if (intr)
> >>>>>> +                       opts->sample_intr_vec_regs = bitmap;
> >>>>>> +               else
> >>>>>> +                       opts->sample_user_vec_regs = bitmap;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return matched;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>> +{
> >>>>>> +       const struct sample_reg *r = NULL;
> >>>>>> +       bool matched = false;
> >>>>>> +       uint64_t bitmap = 0;
> >>>>>> +       u16 qwords = 0;
> >>>>>> +       int reg_idx;
> >>>>>> +
> >>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>> +               if (strcasecmp(s, r->name))
> >>>>>> +                       continue;
> >>>>>> +               if (!fls64(r->mask))
> >>>>>> +                       continue;
> >>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>> +               if (intr)
> >>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               else
> >>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>> +               matched = true;
> >>>>>> +               break;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       /* Just need the highest qwords */
> >>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
> >>>>>> +               opts->sample_pred_regs_qwords = qwords;
> >>>>>> +               if (intr)
> >>>>>> +                       opts->sample_intr_pred_regs = bitmap;
> >>>>>> +               else
> >>>>>> +                       opts->sample_user_pred_regs = bitmap;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       return matched;
> >>>>>> +}
> >>>>>>
> >>>>>>  static int
> >>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>  {
> >>>>>>         uint64_t *mode = (uint64_t *)opt->value;
> >>>>>>         const struct sample_reg *r = NULL;
> >>>>>> +       struct record_opts *opts;
> >>>>>>         char *s, *os = NULL, *p;
> >>>>>> -       int ret = -1;
> >>>>>> +       bool has_simd_regs = false;
> >>>>>>         uint64_t mask;
> >>>>>> +       uint64_t simd_mask;
> >>>>>> +       uint64_t pred_mask;
> >>>>>> +       int ret = -1;
> >>>>>>
> >>>>>>         if (unset)
> >>>>>>                 return 0;
> >>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>         if (*mode)
> >>>>>>                 return -1;
> >>>>>>
> >>>>>> -       if (intr)
> >>>>>> +       if (intr) {
> >>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>>>>>                 mask = arch__intr_reg_mask();
> >>>>>> -       else
> >>>>>> +               simd_mask = arch__intr_simd_reg_mask();
> >>>>>> +               pred_mask = arch__intr_pred_reg_mask();
> >>>>>> +       } else {
> >>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>>>>>                 mask = arch__user_reg_mask();
> >>>>>> +               simd_mask = arch__user_simd_reg_mask();
> >>>>>> +               pred_mask = arch__user_pred_reg_mask();
> >>>>>> +       }
> >>>>>>
> >>>>>>         /* str may be NULL in case no arg is passed to -I */
> >>>>>>         if (str) {
> >>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>                                         if (r->mask & mask)
> >>>>>>                                                 fprintf(stderr, "%s ", r->name);
> >>>>>>                                 }
> >>>>>> +                               __print_simd_regs(intr, simd_mask);
> >>>>>> +                               __print_pred_regs(intr, pred_mask);
> >>>>>>                                 fputc('\n', stderr);
> >>>>>>                                 /* just printing available regs */
> >>>>>>                                 goto error;
> >>>>>>                         }
> >>>>>> +
> >>>>>> +                       if (simd_mask) {
> >>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >>>>>> +                               if (has_simd_regs)
> >>>>>> +                                       goto next;
> >>>>>> +                       }
> >>>>>> +                       if (pred_mask) {
> >>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >>>>>> +                               if (has_simd_regs)
> >>>>>> +                                       goto next;
> >>>>>> +                       }
> >>>>>> +
> >>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>>>>>                                         break;
> >>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>                         }
> >>>>>>
> >>>>>>                         *mode |= r->mask;
> >>>>>> -
> >>>>>> +next:
> >>>>>>                         if (!p)
> >>>>>>                                 break;
> >>>>>>
> >>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>         ret = 0;
> >>>>>>
> >>>>>>         /* default to all possible regs */
> >>>>>> -       if (*mode == 0)
> >>>>>> +       if (*mode == 0 && !has_simd_regs)
> >>>>>>                 *mode = mask;
> >>>>>>  error:
> >>>>>>         free(os);
> >>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>> index 66b666d9ce64..fb0366d050cf 100644
> >>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
> >>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>>>>>
> >>>>>>         return ret;
> >>>>>>  }
> >>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
> >>>>>> --- a/tools/perf/util/perf_regs.c
> >>>>>> +++ b/tools/perf/util/perf_regs.c
> >>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>>>>>         return SDT_ARG_SKIP;
> >>>>>>  }
> >>>>>>
> >>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >>>>>> +{
> >>>>>> +       return false;
> >>>>>> +}
> >>>>>> +
> >>>>>>  uint64_t __weak arch__intr_reg_mask(void)
> >>>>>>  {
> >>>>>>         return 0;
> >>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>>>>>         return 0;
> >>>>>>  }
> >>>>>>
> >>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
> >>>>>> +{
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>> +{
> >>>>>> +       *qwords = 0;
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>         SMPL_REG_END
> >>>>>>  };
> >>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>>>>>         return sample_reg_masks;
> >>>>>>  }
> >>>>>>
> >>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_reg_masks;
> >>>>>> +}
> >>>>>> +
> >>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >>>>>> +{
> >>>>>> +       return sample_reg_masks;
> >>>>>> +}
> >>>>>> +
> >>>>>>  const char *perf_reg_name(int id, const char *arch)
> >>>>>>  {
> >>>>>>         const char *reg_name = NULL;
> >>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
> >>>>>> --- a/tools/perf/util/perf_regs.h
> >>>>>> +++ b/tools/perf/util/perf_regs.h
> >>>>>> @@ -24,9 +24,20 @@ enum {
> >>>>>>  };
> >>>>>>
> >>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >>>>>> +bool arch_has_simd_regs(u64 mask);
> >>>>>>  uint64_t arch__intr_reg_mask(void);
> >>>>>>  uint64_t arch__user_reg_mask(void);
> >>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
> >>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> >>>>> I wonder we can remove these functions. perf_reg_name(int id, uint16_t
> >>>>> e_machine) maps a perf register number and e_machine to a string. So
> >>>>> the sample_reg array could be replaced with:
> >>>>> ```
> >>>>> for (int perf_reg = 0; perf_reg < 64; perf_reg++) {
> >>>>>   uint64_t mask = 1LL << perf_reg;
> >>>>>   const char *name = perf_reg_name(perf_reg, EM_HOST);
> >>>>>   if (name == NULL)
> >>>>>     break;
> >>>>>   // use mask and name
> >>>>> ```
> >>>>> To make it work for SIMD and PRED then I guess we need to iterate
> >>>>> through the ABIs of enum perf_sample_regs_abi.
> >>>> Suppose so.
> >>>>
> >>>>
> >>>>>> +uint64_t arch__intr_simd_reg_mask(void);
> >>>>>> +uint64_t arch__user_simd_reg_mask(void);
> >>>>>> +uint64_t arch__intr_pred_reg_mask(void);
> >>>>>> +uint64_t arch__user_pred_reg_mask(void);
> >>>>> I think some comments would be useful here like:
> >>>>> ```
> >>>>> /* Perf register bit map with valid bits for
> >>>>> perf_event_attr.sample_regs_user. */
> >>>>> uint64_t arch__intr_reg_mask(void);
> >>>>> /* Perf register bit map with valid bits for
> >>>>> perf_event_attr.sample_regs_intr. */
> >>>>> uint64_t arch__user_reg_mask(void);
> >>>>> /* Perf register bit map with valid bits for
> >>>>> perf_event_attr.sample_simd_vec_reg_intr. */
> >>>>> uint64_t arch__intr_simd_reg_mask(void);
> >>>>> /* Perf register bit map with valid bits for
> >>>>> perf_event_attr.sample_simd_vec_reg_user. */
> >>>>> uint64_t arch__user_simd_reg_mask(void);
> >>>>> /* Perf register bit map with valid bits for
> >>>>> perf_event_attr.sample_simd_pred_reg_intr. */
> >>>>> uint64_t arch__intr_pred_reg_mask(void);
> >>>>> /* Perf register bit map with valid bits for
> >>>>> perf_event_attr.sample_simd_pred_reg_user. */
> >>>>> uint64_t arch__user_pred_reg_mask(void);
> >>>> Sure. Thanks.
> >>>>
> >>>>
> >>>>> ```
> >>>>>
> >>>>> Why do the arch__user_pred_reg_mask return a uint64_t when the
> >>>>> perf_event_attr variable is a __u32?
> >>>> Suppose it's a bug. :)
> >>>>
> >>>>
> >>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>> I don't understand this function. The qwords is specific to a
> >>>>> perf_event_attr. We could have an evlist with an evsel set up to
> >>>>> sample say XMM registers and another evsel set up to sample ZMM
> >>>>> registers. Are the qwords here always for the ZMM case, or is XMM,
> >>>>> YMM, ZMM depending on architecture support? Why does it vary per
> >>>>> register? The surrounding code uses the term mask but here bitmap is
> >>>>> used, is the inconsistency deliberate? Why are there user and intr
> >>>>> functions when in the perf_event_attr there are only
> >>>>> sample_simd_pred_reg_qwords and sample_simd_ved_reg_qwords variables?
> >>>> These 4 functions is designed to get the bitmask and qwords length for a
> >>>> specific SIMD registers. E.g., For XMM on x86 platforms, the returned
> >>>> bitmask is 0xffff (xmm0 ~ xmm15) and the qwords length is 2 (128 bits). For
> >>>> ZMM on x86 platforms, if the platform only supports 16 ZMM registers, then
> >>>> the returned bitmask is 0xffff (zmm0 ~ zmm15) and qwords length is 8 (512
> >>>> bits). If the platform supports 32 ZMM registers, then the returned bitmask
> >>>> is 0xffffffff (zmm0 ~ zmm31) and qwords length is 8 (512 bits).
> >>> What is the meaning of reg? In this file it is normally the integer
> >>> index for a bit in the sample_regs_user mask, but for x86 I don't see
> >>> enum perf_event_x86_regs having differing XMM, YMM and ZMM encodings.
> >>> Similarly, is qwords an out argument, but then you also have the
> >>> bitmap. It looks like the code is caching values but that assumes a
> >>> single qword length for all events.
> >> Yes, the "reg" argument indicates the SIMD register index. Strictly
> >> speaking for x86 platform, the qwords length is fixed for a specific SIMD
> >> register and only the register number could vary, e.g., some platforms
> >> could only support 16 ZMM registers, but some other platforms could support
> >> 32 ZMM registers. But considering this is a generic function for all kinds
> >> of archs, we can't ensure there are fixed length for a specific SIMD
> >> register on any arch, so I introduce  the "qwords" argument to increase the
> >> flexibility.
> > I'm still not understanding this still :-) What is a "SIMD register
> > index", the file is for perf registers and naturally enum
> > perf_event_x86_regs on x86, but that doesn't encode YMM and ZMM
> > registers. Perhaps you can give some examples?
>
> Yes, it's something just like the register index in the enum
> perf_event_x86_regs, e.g. the index of AX register is PERF_REG_X86_AX, the
> index of BX is PERF_REG_X86_BX, and so on.
>
> But the difference is that each index in the perf_event_x86_regs can only
> represent a u64 word. Assume we still want to represent the SIMD registers
> with the perf_event_x86_regs enum, then each XMM register needs 2 indexes,
> each YMM register needs 4 indexes and each ZMM needs 8 indexes. Considering
> there are 16 XMM registers, 16 YMM registers and 32 ZMM registers. To
> represent all these indexes, then the enum perf_event_x86_regs would become
> quite large, and correspondingly the sample_regs_intr/sample_regs_user
> fields in the perf_event_attr would have to inflate much. That would
> consume much memory.
>
> So that's why we introduce the new below attributes.
>
> + union { + __u16 sample_simd_regs_enabled; + __u16
> sample_simd_pred_reg_qwords; + }; + __u32 sample_simd_pred_reg_intr; +
> __u32 sample_simd_pred_reg_user; + __u16 sample_simd_vec_reg_qwords; +
> __u64 sample_simd_vec_reg_intr; + __u64 sample_simd_vec_reg_user; + __u32
> __reserved_4; For SIMD registers, each kind of SIMD register would be
> treated as a whole. The sample_simd_vec_reg_qwords would be used to
> identify the length of SIMD register, simultaneously it also hint which
> kind of SIMD register it is since the length of each kind of SIMD register
> is different. E.g. we want to sample XMM registers. We know there are 16
> XMM registers on the x86 platform and qwords length of XMM register is 2.
> So user space needs to set the attributes like this,
>
> sample_simd_vec_reg_intr = 0xffff;
>
> sample_simd_vec_reg_qwords = 2;
>
> Come back to "reg" argument, we know there could be multiple kinds of SIMD
> registers supported on some kind of arch, e.g., x86 support XMM, YMM, ZMM
> and OPMASK SIMD registers. As each kind of SIMD register is always sampled
> as a whole, we don't need to represent each of SIMD register, like XMM0,
> XMM1, but we indeed need to distinguish different kinds of SIMD register,
> like XMM and YMM registers, since they have different register length and
> number.
>
> That's why we define the index for each kind of SIMD register, like below,
>
> +enum { + PERF_REG_X86_XMM, + PERF_REG_X86_YMM, + PERF_REG_X86_ZMM, +
> PERF_REG_X86_MAX_SIMD_REGS, + + PERF_REG_X86_OPMASK = 0, +
> PERF_REG_X86_MAX_PRED_REGS = 1, +}; It's similar withperf_event_x86_regs, but each index represents a kind of SIMD register instead of a specific SIMD register.

Could you give me an example call to say
arch__intr_simd_reg_bitmap_qwords where you say what the value of reg
is, what the expected value of qwords is and what the result will be?
Could you do it for say a model without AVX, a model with AVX, a model
with AVX512 and a model with APX.

I have looked at the code and read the changes to perf_event_attr
which is why I was confused by your saying that ZMM could be passed in
as a perf register number. I am confused as why when the
perf_event_attr has 2 qword length related variables this code seems
to be setting things up so that every register can have a qword
length. I'm confused what is happening with the return value of this
function. As values are being stored into global variables, and you
are saying they aren't a max value, then how does this impact the
setting up of multiple register sampling events?

Thanks,
Ian

> >
> > How does the generic differing qword per register case get encoded
> > into a perf_event_attr? If it can't be then this seems like
> > functionality for no benefit. I also don't understand how the data in
> > the PERF_SAMPLE_REGS_USER part of a sample could be decoded as that is
> > assuming a constant qword number.
> >
> >> No, the qwords would be assigned to true register length if the register
> >> exists on the platform, e.g., xmm = 2, ymm =  4 and zmm = 8. if the
> >> register is not support on the platfom, the qwords would be set to 0.
> > So it is a max function of the vector/pred qwords supported on the architecture.
>
> Strictly speaking, it's not "max" function of the vector/pred qwords, it's
> just a function to get the exact vector/pred qwords supported on the
> architecture since qwords length won't vary for a fixed kind of SIMD register.
>
>
> >
> >>>> Since the qword length is always fixed for any certain SIMD register
> >>>> regardless of intr or user, so there is only one
> >>>> sample_simd_pred_reg_qwords or sample_simd_ved_reg_qwords variable.
> >>> Ok.  2 variables, but 4 functions here. I think there should just be 2
> >>> because of this.
> >> Yes, the user and intr variants would be merged into only one.
> > Thanks,
> > Ian
> >
> >>> Thanks,
> >>> Ian
> >>>
> >>>>> Perhaps these functions should be something more like:
> >>>>> ```
> >>>>> /* Maximum value that can be assigned to
> >>>>> perf_event_atttr.sample_simd_pred_reg_qwords. */
> >>>>> uint16_t arch__simd_pred_reg_qwords_max(void);
> >>>>> /* Maximum value that can be assigned to
> >>>>> perf_event_atttr.sample_simd_vec_reg_qwords. */
> >>>>> uint16_t arch__simd_vec_reg_qwords_max(void);
> >>>>> ```
> >>>>> Then the bitmap computation logic can all be moved into parse-regs-options.c.
> >>>>>
> >>>>> Thanks,
> >>>>> Ian
> >>>>>
> >>>>>>  const char *perf_reg_name(int id, const char *arch);
> >>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
> >>>>>> --- a/tools/perf/util/record.h
> >>>>>> +++ b/tools/perf/util/record.h
> >>>>>> @@ -59,7 +59,13 @@ struct record_opts {
> >>>>>>         unsigned int  user_freq;
> >>>>>>         u64           branch_stack;
> >>>>>>         u64           sample_intr_regs;
> >>>>>> +       u64           sample_intr_vec_regs;
> >>>>>>         u64           sample_user_regs;
> >>>>>> +       u64           sample_user_vec_regs;
> >>>>>> +       u16           sample_pred_regs_qwords;
> >>>>>> +       u16           sample_vec_regs_qwords;
> >>>>>> +       u16           sample_intr_pred_regs;
> >>>>>> +       u16           sample_user_pred_regs;
> >>>>>>         u64           default_interval;
> >>>>>>         u64           user_interval;
> >>>>>>         size_t        auxtrace_snapshot_size;
> >>>>>> --
> >>>>>> 2.34.1
> >>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-21 14:48               ` Ian Rogers
@ 2026-01-22  1:49                 ` Mi, Dapeng
  2026-01-22  7:27                   ` Ian Rogers
  0 siblings, 1 reply; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-22  1:49 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/21/2026 10:48 PM, Ian Rogers wrote:
> On Tue, Jan 20, 2026 at 11:52 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 1/21/2026 3:09 PM, Ian Rogers wrote:
>>> On Tue, Jan 20, 2026 at 9:17 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 1/21/2026 2:20 AM, Ian Rogers wrote:
>>>>> On Tue, Jan 20, 2026 at 1:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>> On 1/20/2026 3:39 PM, Ian Rogers wrote:
>>>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>>
>>>>>>>> This patch adds support for the newly introduced SIMD register sampling
>>>>>>>> format by adding the following functions:
>>>>>>>>
>>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>
>>>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>>>>>
>>>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>>>>>
>>>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>>>>>
>>>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>>>>>> OPMASK).
>>>>>>>>
>>>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>>>>>> these newly introduced SIMD registers. Currently, each type of register
>>>>>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>>>>>> not supported. For example, all XMM registers are sampled together rather
>>>>>>>> than sampling only XMM0.
>>>>>>>>
>>>>>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>>>>>
>>>>>>>> With this patch, all supported sampling registers on x86 platforms are
>>>>>>>> displayed as follows.
>>>>>>>>
>>>>>>>>  $perf record -I?
>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>
>>>>>>>>  $perf record --user-regs=?
>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>
>>>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>> ---
>>>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>>>>>  tools/perf/util/record.h                  |   6 +
>>>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>> index 12fd93f04802..db41430f3b07 100644
>>>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>> @@ -13,6 +13,49 @@
>>>>>>>>  #include "../../../util/pmu.h"
>>>>>>>>  #include "../../../util/pmus.h"
>>>>>>>>
>>>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>>>>>> +#endif
>>>>>>>> +       SMPL_REG_END
>>>>>>>> +};
>>>>>>>> +
>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>>>>>         return SDT_ARG_VALID;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>>>>>>> +{
>>>>>>>> +       struct perf_event_attr attr = {
>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>> +               .disabled                       = 1,
>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>> +               .sample_simd_regs_enabled       = 1,
>>>>>>>> +       };
>>>>>>>> +       int fd;
>>>>>>>> +
>>>>>>>> +       attr.sample_period = 1;
>>>>>>>> +
>>>>>>>> +       if (!pred) {
>>>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>>>>>> +               else
>>>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>>>>>> +       } else {
>>>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>>>>>> +               else
>>>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>> +               struct perf_pmu *pmu = NULL;
>>>>>>>> +               __u64 type = PERF_TYPE_RAW;
>>>>>>>> +
>>>>>>>> +               /*
>>>>>>>> +                * The same register set is supported among different hybrid PMUs.
>>>>>>>> +                * Only check the first available one.
>>>>>>>> +                */
>>>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>>>>>> +                       type = pmu->type;
>>>>>>>> +                       break;
>>>>>>>> +               }
>>>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       event_attr_init(&attr);
>>>>>>>> +
>>>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>> +       if (fd != -1) {
>>>>>>>> +               close(fd);
>>>>>>>> +               return true;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return false;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       bool supported = false;
>>>>>>>> +       u64 bits;
>>>>>>>> +
>>>>>>>> +       *mask = 0;
>>>>>>>> +       *qwords = 0;
>>>>>>>> +
>>>>>>>> +       switch (reg) {
>>>>>>>> +       case PERF_REG_X86_XMM:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       case PERF_REG_X86_YMM:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       case PERF_REG_X86_ZMM:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>>>>>> +                       break;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       default:
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return supported;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       bool supported = false;
>>>>>>>> +       u64 bits;
>>>>>>>> +
>>>>>>>> +       *mask = 0;
>>>>>>>> +       *qwords = 0;
>>>>>>>> +
>>>>>>>> +       switch (reg) {
>>>>>>>> +       case PERF_REG_X86_OPMASK:
>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>>>>>> +               if (supported) {
>>>>>>>> +                       *mask = bits;
>>>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +       default:
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return supported;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool has_cap_simd_regs(void)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>>>>>> +       static bool has_cap_simd_regs;
>>>>>>>> +       static bool cached;
>>>>>>>> +
>>>>>>>> +       if (cached)
>>>>>>>> +               return has_cap_simd_regs;
>>>>>>>> +
>>>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>> +       cached = true;
>>>>>>>> +
>>>>>>>> +       return has_cap_simd_regs;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +bool arch_has_simd_regs(u64 mask)
>>>>>>>> +{
>>>>>>>> +       return has_cap_simd_regs() &&
>>>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>>>>>> +       SMPL_REG_END
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>>>>>> +       SMPL_REG_END
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_simd_reg_masks;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_pred_reg_masks;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool x86_intr_simd_updated;
>>>>>>>> +static u64 x86_intr_simd_reg_mask;
>>>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>> +static bool x86_user_simd_updated;
>>>>>>>> +static u64 x86_user_simd_reg_mask;
>>>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>> +
>>>>>>>> +static bool x86_intr_pred_updated;
>>>>>>>> +static u64 x86_intr_pred_reg_mask;
>>>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +static bool x86_user_pred_updated;
>>>>>>>> +static u64 x86_user_pred_reg_mask;
>>>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>> +
>>>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool supported;
>>>>>>>> +       u64 mask = 0;
>>>>>>>> +       int reg;
>>>>>>>> +
>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>> +               return 0;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>>>>>> +               return x86_intr_simd_reg_mask;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>>>>>> +               return x86_user_simd_reg_mask;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>> +               supported = false;
>>>>>>>> +
>>>>>>>> +               if (!r->mask)
>>>>>>>> +                       continue;
>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>> +
>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>>>>>> +                       break;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_intr_simd_mask[reg],
>>>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_user_simd_mask[reg],
>>>>>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>>>>>> +               if (supported)
>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>> +               x86_intr_simd_reg_mask = mask;
>>>>>>>> +               x86_intr_simd_updated = true;
>>>>>>>> +       } else {
>>>>>>>> +               x86_user_simd_reg_mask = mask;
>>>>>>>> +               x86_user_simd_updated = true;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool supported;
>>>>>>>> +       u64 mask = 0;
>>>>>>>> +       int reg;
>>>>>>>> +
>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>> +               return 0;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>>>>>> +               return x86_intr_pred_reg_mask;
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>>>>>> +               return x86_user_pred_reg_mask;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>> +               supported = false;
>>>>>>>> +
>>>>>>>> +               if (!r->mask)
>>>>>>>> +                       continue;
>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>> +
>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>>>>>> +                       break;
>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_intr_pred_mask[reg],
>>>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>> +                                                        &x86_user_pred_mask[reg],
>>>>>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>>>>>> +               if (supported)
>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>> +               x86_intr_pred_reg_mask = mask;
>>>>>>>> +               x86_intr_pred_updated = true;
>>>>>>>> +       } else {
>>>>>>>> +               x86_user_pred_reg_mask = mask;
>>>>>>>> +               x86_user_pred_updated = true;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = 0;
>>>>>>>> +
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>>>>>> +               if (intr) {
>>>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>>>>>> +                       mask = x86_intr_simd_mask[reg];
>>>>>>>> +               } else {
>>>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>>>>>> +                       mask = x86_user_simd_mask[reg];
>>>>>>>> +               }
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = 0;
>>>>>>>> +
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>>>>>> +               if (intr) {
>>>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>>>>>> +                       mask = x86_intr_pred_mask[reg];
>>>>>>>> +               } else {
>>>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>>>>>> +                       mask = x86_user_pred_mask[reg];
>>>>>>>> +               }
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_intr_simd_updated)
>>>>>>>> +               arch__intr_simd_reg_mask();
>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_user_simd_updated)
>>>>>>>> +               arch__user_simd_reg_mask();
>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_intr_pred_updated)
>>>>>>>> +               arch__intr_pred_reg_mask();
>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       if (!x86_user_pred_updated)
>>>>>>>> +               arch__user_pred_reg_mask();
>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>>>>>  {
>>>>>>>> +       if (has_cap_simd_regs())
>>>>>>>> +               return sample_reg_masks_ext;
>>>>>>>>         return sample_reg_masks;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> -uint64_t arch__intr_reg_mask(void)
>>>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>>>>>  {
>>>>>>>>         struct perf_event_attr attr = {
>>>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>>>>>> -               .precise_ip             = 1,
>>>>>>>> -               .disabled               = 1,
>>>>>>>> -               .exclude_kernel         = 1,
>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>> +               .precise_ip                     = 1,
>>>>>>>> +               .disabled                       = 1,
>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>>>>>         };
>>>>>>>>         int fd;
>>>>>>>>         /*
>>>>>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>>>>>          */
>>>>>>>>         attr.sample_period = 1;
>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>> +               attr.sample_regs_intr = mask;
>>>>>>>> +       else
>>>>>>>> +               attr.sample_regs_user = mask;
>>>>>>>>
>>>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>>                 struct perf_pmu *pmu = NULL;
>>>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>>         if (fd != -1) {
>>>>>>>>                 close(fd);
>>>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>>>>>> +               return mask;
>>>>>>>>         }
>>>>>>>>
>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t arch__intr_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>> +
>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>> +                                        true);
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>> +                                        true);
>>>>>>>> +       } else
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  uint64_t arch__user_reg_mask(void)
>>>>>>>>  {
>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>> +
>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>> +                                        true);
>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>> +                                        true);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return mask;
>>>>>>>>  }
>>>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>>>>>> --- a/tools/perf/util/evsel.c
>>>>>>>> +++ b/tools/perf/util/evsel.c
>>>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>> +               else
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>> +               else
>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>>         }
>>>>>>>>
>>>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>>>>>> index cda1c620968e..0bd100392889 100644
>>>>>>>> --- a/tools/perf/util/parse-regs-options.c
>>>>>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>>>>>> @@ -4,19 +4,139 @@
>>>>>>>>  #include <stdint.h>
>>>>>>>>  #include <string.h>
>>>>>>>>  #include <stdio.h>
>>>>>>>> +#include <linux/bitops.h>
>>>>>>>>  #include "util/debug.h"
>>>>>>>>  #include <subcmd/parse-options.h>
>>>>>>>>  #include "util/perf_regs.h"
>>>>>>>>  #include "util/parse-regs-options.h"
>>>>>>>> +#include "record.h"
>>>>>>>> +
>>>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       if (!simd_mask)
>>>>>>>> +               return;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>> +               if (!(r->mask & simd_mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               if (bitmap)
>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>> +       }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       if (!pred_mask)
>>>>>>>> +               return;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>> +               if (!(r->mask & pred_mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               if (bitmap)
>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>> +       }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool matched = false;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>> +                       continue;
>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               matched = true;
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       /* Just need the highest qwords */
>>>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>>>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>>>>>> +               if (intr)
>>>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>>>>>> +               else
>>>>>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return matched;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>> +{
>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>> +       bool matched = false;
>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>> +       u16 qwords = 0;
>>>>>>>> +       int reg_idx;
>>>>>>>> +
>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>> +                       continue;
>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>> +                       continue;
>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>> +               if (intr)
>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               else
>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>> +               matched = true;
>>>>>>>> +               break;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       /* Just need the highest qwords */
>>>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>>>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>>>>>> +               if (intr)
>>>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>>>>>> +               else
>>>>>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       return matched;
>>>>>>>> +}
>>>>>>>>
>>>>>>>>  static int
>>>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>  {
>>>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>>>>>         const struct sample_reg *r = NULL;
>>>>>>>> +       struct record_opts *opts;
>>>>>>>>         char *s, *os = NULL, *p;
>>>>>>>> -       int ret = -1;
>>>>>>>> +       bool has_simd_regs = false;
>>>>>>>>         uint64_t mask;
>>>>>>>> +       uint64_t simd_mask;
>>>>>>>> +       uint64_t pred_mask;
>>>>>>>> +       int ret = -1;
>>>>>>>>
>>>>>>>>         if (unset)
>>>>>>>>                 return 0;
>>>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>         if (*mode)
>>>>>>>>                 return -1;
>>>>>>>>
>>>>>>>> -       if (intr)
>>>>>>>> +       if (intr) {
>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>>>>>                 mask = arch__intr_reg_mask();
>>>>>>>> -       else
>>>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>>>>>> +       } else {
>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>>>>>                 mask = arch__user_reg_mask();
>>>>>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>>>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>>>>>> +       }
>>>>>>>>
>>>>>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>>>>>         if (str) {
>>>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>                                         if (r->mask & mask)
>>>>>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>>>>>                                 }
>>>>>>>> +                               __print_simd_regs(intr, simd_mask);
>>>>>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>>>>>                                 fputc('\n', stderr);
>>>>>>>>                                 /* just printing available regs */
>>>>>>>>                                 goto error;
>>>>>>>>                         }
>>>>>>>> +
>>>>>>>> +                       if (simd_mask) {
>>>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>>>>>> +                               if (has_simd_regs)
>>>>>>>> +                                       goto next;
>>>>>>>> +                       }
>>>>>>>> +                       if (pred_mask) {
>>>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>>>>>> +                               if (has_simd_regs)
>>>>>>>> +                                       goto next;
>>>>>>>> +                       }
>>>>>>>> +
>>>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>>>>>                                         break;
>>>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>                         }
>>>>>>>>
>>>>>>>>                         *mode |= r->mask;
>>>>>>>> -
>>>>>>>> +next:
>>>>>>>>                         if (!p)
>>>>>>>>                                 break;
>>>>>>>>
>>>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>         ret = 0;
>>>>>>>>
>>>>>>>>         /* default to all possible regs */
>>>>>>>> -       if (*mode == 0)
>>>>>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>>>>>                 *mode = mask;
>>>>>>>>  error:
>>>>>>>>         free(os);
>>>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>>>>>
>>>>>>>>         return ret;
>>>>>>>>  }
>>>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>>>>>> --- a/tools/perf/util/perf_regs.c
>>>>>>>> +++ b/tools/perf/util/perf_regs.c
>>>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>>>>>         return SDT_ARG_SKIP;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>>>>>> +{
>>>>>>>> +       return false;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>>>>>  {
>>>>>>>>         return 0;
>>>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>>>>>         return 0;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>>>>>> +{
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>> +{
>>>>>>>> +       *qwords = 0;
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>         SMPL_REG_END
>>>>>>>>  };
>>>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>>>>>         return sample_reg_masks;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_reg_masks;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>>>>>> +{
>>>>>>>> +       return sample_reg_masks;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>>>>>  {
>>>>>>>>         const char *reg_name = NULL;
>>>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>>>>>> --- a/tools/perf/util/perf_regs.h
>>>>>>>> +++ b/tools/perf/util/perf_regs.h
>>>>>>>> @@ -24,9 +24,20 @@ enum {
>>>>>>>>  };
>>>>>>>>
>>>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>>>>>> +bool arch_has_simd_regs(u64 mask);
>>>>>>>>  uint64_t arch__intr_reg_mask(void);
>>>>>>>>  uint64_t arch__user_reg_mask(void);
>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>>>>> I wonder we can remove these functions. perf_reg_name(int id, uint16_t
>>>>>>> e_machine) maps a perf register number and e_machine to a string. So
>>>>>>> the sample_reg array could be replaced with:
>>>>>>> ```
>>>>>>> for (int perf_reg = 0; perf_reg < 64; perf_reg++) {
>>>>>>>   uint64_t mask = 1LL << perf_reg;
>>>>>>>   const char *name = perf_reg_name(perf_reg, EM_HOST);
>>>>>>>   if (name == NULL)
>>>>>>>     break;
>>>>>>>   // use mask and name
>>>>>>> ```
>>>>>>> To make it work for SIMD and PRED then I guess we need to iterate
>>>>>>> through the ABIs of enum perf_sample_regs_abi.
>>>>>> Suppose so.
>>>>>>
>>>>>>
>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>>>>> I think some comments would be useful here like:
>>>>>>> ```
>>>>>>> /* Perf register bit map with valid bits for
>>>>>>> perf_event_attr.sample_regs_user. */
>>>>>>> uint64_t arch__intr_reg_mask(void);
>>>>>>> /* Perf register bit map with valid bits for
>>>>>>> perf_event_attr.sample_regs_intr. */
>>>>>>> uint64_t arch__user_reg_mask(void);
>>>>>>> /* Perf register bit map with valid bits for
>>>>>>> perf_event_attr.sample_simd_vec_reg_intr. */
>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>>> /* Perf register bit map with valid bits for
>>>>>>> perf_event_attr.sample_simd_vec_reg_user. */
>>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>>> /* Perf register bit map with valid bits for
>>>>>>> perf_event_attr.sample_simd_pred_reg_intr. */
>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>>> /* Perf register bit map with valid bits for
>>>>>>> perf_event_attr.sample_simd_pred_reg_user. */
>>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>> Sure. Thanks.
>>>>>>
>>>>>>
>>>>>>> ```
>>>>>>>
>>>>>>> Why do the arch__user_pred_reg_mask return a uint64_t when the
>>>>>>> perf_event_attr variable is a __u32?
>>>>>> Suppose it's a bug. :)
>>>>>>
>>>>>>
>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>> I don't understand this function. The qwords is specific to a
>>>>>>> perf_event_attr. We could have an evlist with an evsel set up to
>>>>>>> sample say XMM registers and another evsel set up to sample ZMM
>>>>>>> registers. Are the qwords here always for the ZMM case, or is XMM,
>>>>>>> YMM, ZMM depending on architecture support? Why does it vary per
>>>>>>> register? The surrounding code uses the term mask but here bitmap is
>>>>>>> used, is the inconsistency deliberate? Why are there user and intr
>>>>>>> functions when in the perf_event_attr there are only
>>>>>>> sample_simd_pred_reg_qwords and sample_simd_ved_reg_qwords variables?
>>>>>> These 4 functions is designed to get the bitmask and qwords length for a
>>>>>> specific SIMD registers. E.g., For XMM on x86 platforms, the returned
>>>>>> bitmask is 0xffff (xmm0 ~ xmm15) and the qwords length is 2 (128 bits). For
>>>>>> ZMM on x86 platforms, if the platform only supports 16 ZMM registers, then
>>>>>> the returned bitmask is 0xffff (zmm0 ~ zmm15) and qwords length is 8 (512
>>>>>> bits). If the platform supports 32 ZMM registers, then the returned bitmask
>>>>>> is 0xffffffff (zmm0 ~ zmm31) and qwords length is 8 (512 bits).
>>>>> What is the meaning of reg? In this file it is normally the integer
>>>>> index for a bit in the sample_regs_user mask, but for x86 I don't see
>>>>> enum perf_event_x86_regs having differing XMM, YMM and ZMM encodings.
>>>>> Similarly, is qwords an out argument, but then you also have the
>>>>> bitmap. It looks like the code is caching values but that assumes a
>>>>> single qword length for all events.
>>>> Yes, the "reg" argument indicates the SIMD register index. Strictly
>>>> speaking for x86 platform, the qwords length is fixed for a specific SIMD
>>>> register and only the register number could vary, e.g., some platforms
>>>> could only support 16 ZMM registers, but some other platforms could support
>>>> 32 ZMM registers. But considering this is a generic function for all kinds
>>>> of archs, we can't ensure there are fixed length for a specific SIMD
>>>> register on any arch, so I introduce  the "qwords" argument to increase the
>>>> flexibility.
>>> I'm still not understanding this still :-) What is a "SIMD register
>>> index", the file is for perf registers and naturally enum
>>> perf_event_x86_regs on x86, but that doesn't encode YMM and ZMM
>>> registers. Perhaps you can give some examples?
>> Yes, it's something just like the register index in the enum
>> perf_event_x86_regs, e.g. the index of AX register is PERF_REG_X86_AX, the
>> index of BX is PERF_REG_X86_BX, and so on.
>>
>> But the difference is that each index in the perf_event_x86_regs can only
>> represent a u64 word. Assume we still want to represent the SIMD registers
>> with the perf_event_x86_regs enum, then each XMM register needs 2 indexes,
>> each YMM register needs 4 indexes and each ZMM needs 8 indexes. Considering
>> there are 16 XMM registers, 16 YMM registers and 32 ZMM registers. To
>> represent all these indexes, then the enum perf_event_x86_regs would become
>> quite large, and correspondingly the sample_regs_intr/sample_regs_user
>> fields in the perf_event_attr would have to inflate much. That would
>> consume much memory.
>>
>> So that's why we introduce the new below attributes.
>>
>> + union { + __u16 sample_simd_regs_enabled; + __u16
>> sample_simd_pred_reg_qwords; + }; + __u32 sample_simd_pred_reg_intr; +
>> __u32 sample_simd_pred_reg_user; + __u16 sample_simd_vec_reg_qwords; +
>> __u64 sample_simd_vec_reg_intr; + __u64 sample_simd_vec_reg_user; + __u32
>> __reserved_4; For SIMD registers, each kind of SIMD register would be
>> treated as a whole. The sample_simd_vec_reg_qwords would be used to
>> identify the length of SIMD register, simultaneously it also hint which
>> kind of SIMD register it is since the length of each kind of SIMD register
>> is different. E.g. we want to sample XMM registers. We know there are 16
>> XMM registers on the x86 platform and qwords length of XMM register is 2.
>> So user space needs to set the attributes like this,
>>
>> sample_simd_vec_reg_intr = 0xffff;
>>
>> sample_simd_vec_reg_qwords = 2;
>>
>> Come back to "reg" argument, we know there could be multiple kinds of SIMD
>> registers supported on some kind of arch, e.g., x86 support XMM, YMM, ZMM
>> and OPMASK SIMD registers. As each kind of SIMD register is always sampled
>> as a whole, we don't need to represent each of SIMD register, like XMM0,
>> XMM1, but we indeed need to distinguish different kinds of SIMD register,
>> like XMM and YMM registers, since they have different register length and
>> number.
>>
>> That's why we define the index for each kind of SIMD register, like below,
>>
>> +enum { + PERF_REG_X86_XMM, + PERF_REG_X86_YMM, + PERF_REG_X86_ZMM, +
>> PERF_REG_X86_MAX_SIMD_REGS, + + PERF_REG_X86_OPMASK = 0, +
>> PERF_REG_X86_MAX_PRED_REGS = 1, +}; It's similar withperf_event_x86_regs, but each index represents a kind of SIMD register instead of a specific SIMD register.
> Could you give me an example call to say
> arch__intr_simd_reg_bitmap_qwords where you say what the value of reg
> is, what the expected value of qwords is and what the result will be?
> Could you do it for say a model without AVX, a model with AVX, a model
> with AVX512 and a model with APX.

Assume we are on a x86 platform which only supports XMM registers (AVX) and
call the function arch__intr_simd_reg_bitmap_qwords() with SIMD register index,

1. reg = PERF_REG_X86_XMM

The return value (XMM registers bitmask) = 0xffff and the qwords = 2 (128 bits).

2. reg = PERF_REG_X86_YMM

The return value (YMM registers bitmask) = 0 and the qwords = 0 since YMM registers are not supported.

3. reg = PERF_REG_X86_ZMM

The return value (ZMM registers bitmask) = 0 and the qwords = 0 since ZMM registers are not supported.

Assume we are on a x86 platform which supports XMM/YMM/ZMM registers (AVX512) and call the function arch__intr_simd_reg_bitmap_qwords() with SIMD register index,

1. reg = PERF_REG_X86_XMM

The return value (XMM registers bitmask) = 0xffff and the qwords = 2 (128 bits).

2. reg = PERF_REG_X86_YMM

The return value (YMM registers bitmask) = 0xffff and the qwords = 4 (256 bits).

3. reg = PERF_REG_X86_ZMM

The return value (ZMM registers bitmask) = 0xffffffff and the qwords = 8 (512 bits). We assume this platform supports 32 ZMM registers (ZMM0 ~ ZMM31).

As for APX, it has nothing to do with these 4 functions, whether it's supported is determined by the helpers arch__intr_reg_mask()/arch__user_reg_mask().
e.g.,

```
	if (has_cap_simd_regs()) {
		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
					 true);
		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
					 BIT_ULL(PERF_REG_X86_SSP),
					 true); 
```
If the platform supports APX eGPRs, then the returned mask from arch__intr_reg_mask()/arch__user_reg_mask() would contain the eGPRs mask, otherwise, it would not.



>
> I have looked at the code and read the changes to perf_event_attr
> which is why I was confused by your saying that ZMM could be passed in
> as a perf register number. I am confused as why when the
> perf_event_attr has 2 qword length related variables this code seems
> to be setting things up so that every register can have a qword
> length. I'm confused what is happening with the return value of this
> function. As values are being stored into global variables, and you
> are saying they aren't a max value, then how does this impact the
> setting up of multiple register sampling events?

The 2 qwords length, sample_simd_pred_reg_qwords is to store the PRED
register length, and sample_simd_vec_reg_qwords is to store the SIMD 
register length. Since the SIMD/PRED registers with larger length would
contain the SIMD/PRED register with shorter length, so only the largest
length would be set  to into the variables
sample_simd_vec_reg_qwords/sample_simd_vec_reg_qwords .

E.g.,

perf record -e cyles:p -Ixmm,ymm,zmm -c 10000 -- sleep 1

The sample_simd_vec_reg_qwords would be set to 8 to represent the largest
length (ZMM) and kernel directly samples ZMM registers since ZMM registers
fully contains YMM and XMM registers.

The reason that caching bitmask and qwords is that the bitmask and qwords
for a specific SIMD/PRED register is fixed on a certain x86 platform,
right?  E.g. the qwords length of XMM register is always 2, YMM is 4, etc...

The bitmask and qwords values are retrieved from kernel by
perf_event_open() syscall which is quite expensive, if it's called
frequently, it would impact the performance heavily. 


>
> Thanks,
> Ian
>
>>> How does the generic differing qword per register case get encoded
>>> into a perf_event_attr? If it can't be then this seems like
>>> functionality for no benefit. I also don't understand how the data in
>>> the PERF_SAMPLE_REGS_USER part of a sample could be decoded as that is
>>> assuming a constant qword number.
>>>
>>>> No, the qwords would be assigned to true register length if the register
>>>> exists on the platform, e.g., xmm = 2, ymm =  4 and zmm = 8. if the
>>>> register is not support on the platfom, the qwords would be set to 0.
>>> So it is a max function of the vector/pred qwords supported on the architecture.
>> Strictly speaking, it's not "max" function of the vector/pred qwords, it's
>> just a function to get the exact vector/pred qwords supported on the
>> architecture since qwords length won't vary for a fixed kind of SIMD register.
>>
>>
>>>>>> Since the qword length is always fixed for any certain SIMD register
>>>>>> regardless of intr or user, so there is only one
>>>>>> sample_simd_pred_reg_qwords or sample_simd_ved_reg_qwords variable.
>>>>> Ok.  2 variables, but 4 functions here. I think there should just be 2
>>>>> because of this.
>>>> Yes, the user and intr variants would be merged into only one.
>>> Thanks,
>>> Ian
>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>>> Perhaps these functions should be something more like:
>>>>>>> ```
>>>>>>> /* Maximum value that can be assigned to
>>>>>>> perf_event_atttr.sample_simd_pred_reg_qwords. */
>>>>>>> uint16_t arch__simd_pred_reg_qwords_max(void);
>>>>>>> /* Maximum value that can be assigned to
>>>>>>> perf_event_atttr.sample_simd_vec_reg_qwords. */
>>>>>>> uint16_t arch__simd_vec_reg_qwords_max(void);
>>>>>>> ```
>>>>>>> Then the bitmap computation logic can all be moved into parse-regs-options.c.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ian
>>>>>>>
>>>>>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>>>>>> --- a/tools/perf/util/record.h
>>>>>>>> +++ b/tools/perf/util/record.h
>>>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>>>>>         unsigned int  user_freq;
>>>>>>>>         u64           branch_stack;
>>>>>>>>         u64           sample_intr_regs;
>>>>>>>> +       u64           sample_intr_vec_regs;
>>>>>>>>         u64           sample_user_regs;
>>>>>>>> +       u64           sample_user_vec_regs;
>>>>>>>> +       u16           sample_pred_regs_qwords;
>>>>>>>> +       u16           sample_vec_regs_qwords;
>>>>>>>> +       u16           sample_intr_pred_regs;
>>>>>>>> +       u16           sample_user_pred_regs;
>>>>>>>>         u64           default_interval;
>>>>>>>>         u64           user_interval;
>>>>>>>>         size_t        auxtrace_snapshot_size;
>>>>>>>> --
>>>>>>>> 2.34.1
>>>>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-22  1:49                 ` Mi, Dapeng
@ 2026-01-22  7:27                   ` Ian Rogers
  2026-01-22  8:29                     ` Mi, Dapeng
  0 siblings, 1 reply; 86+ messages in thread
From: Ian Rogers @ 2026-01-22  7:27 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang

On Wed, Jan 21, 2026 at 5:49 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 1/21/2026 10:48 PM, Ian Rogers wrote:
> > On Tue, Jan 20, 2026 at 11:52 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>
> >> On 1/21/2026 3:09 PM, Ian Rogers wrote:
> >>> On Tue, Jan 20, 2026 at 9:17 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>> On 1/21/2026 2:20 AM, Ian Rogers wrote:
> >>>>> On Tue, Jan 20, 2026 at 1:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
> >>>>>> On 1/20/2026 3:39 PM, Ian Rogers wrote:
> >>>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
> >>>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>>>>>
> >>>>>>>> This patch adds support for the newly introduced SIMD register sampling
> >>>>>>>> format by adding the following functions:
> >>>>>>>>
> >>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
> >>>>>>>> uint64_t arch__user_simd_reg_mask(void);
> >>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
> >>>>>>>> uint64_t arch__user_pred_reg_mask(void);
> >>>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
> >>>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
> >>>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
> >>>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
> >>>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
> >>>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
> >>>>>>>>
> >>>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
> >>>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
> >>>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
> >>>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
> >>>>>>>> OPMASK).
> >>>>>>>>
> >>>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
> >>>>>>>> these newly introduced SIMD registers. Currently, each type of register
> >>>>>>>> can only be sampled collectively; sampling a specific SIMD register is
> >>>>>>>> not supported. For example, all XMM registers are sampled together rather
> >>>>>>>> than sampling only XMM0.
> >>>>>>>>
> >>>>>>>> When multiple overlapping register types, such as XMM and YMM, are
> >>>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
> >>>>>>>>
> >>>>>>>> With this patch, all supported sampling registers on x86 platforms are
> >>>>>>>> displayed as follows.
> >>>>>>>>
> >>>>>>>>  $perf record -I?
> >>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>>>
> >>>>>>>>  $perf record --user-regs=?
> >>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
> >>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
> >>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
> >>>>>>>>
> >>>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>>>>>>> ---
> >>>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
> >>>>>>>>  tools/perf/util/evsel.c                   |  27 ++
> >>>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
> >>>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
> >>>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
> >>>>>>>>  tools/perf/util/perf_regs.h               |  11 +
> >>>>>>>>  tools/perf/util/record.h                  |   6 +
> >>>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>>>> index 12fd93f04802..db41430f3b07 100644
> >>>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
> >>>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
> >>>>>>>> @@ -13,6 +13,49 @@
> >>>>>>>>  #include "../../../util/pmu.h"
> >>>>>>>>  #include "../../../util/pmus.h"
> >>>>>>>>
> >>>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
> >>>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
> >>>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
> >>>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
> >>>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
> >>>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
> >>>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
> >>>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
> >>>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> >>>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
> >>>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
> >>>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> >>>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
> >>>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
> >>>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
> >>>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
> >>>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
> >>>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
> >>>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
> >>>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
> >>>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
> >>>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
> >>>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
> >>>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
> >>>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
> >>>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
> >>>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
> >>>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
> >>>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
> >>>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
> >>>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
> >>>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
> >>>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
> >>>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
> >>>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
> >>>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
> >>>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
> >>>>>>>> +#endif
> >>>>>>>> +       SMPL_REG_END
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
> >>>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
> >>>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
> >>>>>>>>         return SDT_ARG_VALID;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> >>>>>>>> +{
> >>>>>>>> +       struct perf_event_attr attr = {
> >>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>>>> +               .sample_type                    = sample_type,
> >>>>>>>> +               .disabled                       = 1,
> >>>>>>>> +               .exclude_kernel                 = 1,
> >>>>>>>> +               .sample_simd_regs_enabled       = 1,
> >>>>>>>> +       };
> >>>>>>>> +       int fd;
> >>>>>>>> +
> >>>>>>>> +       attr.sample_period = 1;
> >>>>>>>> +
> >>>>>>>> +       if (!pred) {
> >>>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
> >>>>>>>> +               else
> >>>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
> >>>>>>>> +       } else {
> >>>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> >>>>>>>> +               else
> >>>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
> >>>>>>>> +               struct perf_pmu *pmu = NULL;
> >>>>>>>> +               __u64 type = PERF_TYPE_RAW;
> >>>>>>>> +
> >>>>>>>> +               /*
> >>>>>>>> +                * The same register set is supported among different hybrid PMUs.
> >>>>>>>> +                * Only check the first available one.
> >>>>>>>> +                */
> >>>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> >>>>>>>> +                       type = pmu->type;
> >>>>>>>> +                       break;
> >>>>>>>> +               }
> >>>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       event_attr_init(&attr);
> >>>>>>>> +
> >>>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>>>> +       if (fd != -1) {
> >>>>>>>> +               close(fd);
> >>>>>>>> +               return true;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return false;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       bool supported = false;
> >>>>>>>> +       u64 bits;
> >>>>>>>> +
> >>>>>>>> +       *mask = 0;
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +
> >>>>>>>> +       switch (reg) {
> >>>>>>>> +       case PERF_REG_X86_XMM:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       case PERF_REG_X86_YMM:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       case PERF_REG_X86_ZMM:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
> >>>>>>>> +                       break;
> >>>>>>>> +               }
> >>>>>>>> +
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       default:
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return supported;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       bool supported = false;
> >>>>>>>> +       u64 bits;
> >>>>>>>> +
> >>>>>>>> +       *mask = 0;
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +
> >>>>>>>> +       switch (reg) {
> >>>>>>>> +       case PERF_REG_X86_OPMASK:
> >>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
> >>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
> >>>>>>>> +               if (supported) {
> >>>>>>>> +                       *mask = bits;
> >>>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +       default:
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return supported;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool has_cap_simd_regs(void)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
> >>>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
> >>>>>>>> +       static bool has_cap_simd_regs;
> >>>>>>>> +       static bool cached;
> >>>>>>>> +
> >>>>>>>> +       if (cached)
> >>>>>>>> +               return has_cap_simd_regs;
> >>>>>>>> +
> >>>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
> >>>>>>>> +       cached = true;
> >>>>>>>> +
> >>>>>>>> +       return has_cap_simd_regs;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +bool arch_has_simd_regs(u64 mask)
> >>>>>>>> +{
> >>>>>>>> +       return has_cap_simd_regs() &&
> >>>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
> >>>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
> >>>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
> >>>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
> >>>>>>>> +       SMPL_REG_END
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
> >>>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
> >>>>>>>> +       SMPL_REG_END
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_simd_reg_masks;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_pred_reg_masks;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool x86_intr_simd_updated;
> >>>>>>>> +static u64 x86_intr_simd_reg_mask;
> >>>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>>> +static bool x86_user_simd_updated;
> >>>>>>>> +static u64 x86_user_simd_reg_mask;
> >>>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
> >>>>>>>> +
> >>>>>>>> +static bool x86_intr_pred_updated;
> >>>>>>>> +static u64 x86_intr_pred_reg_mask;
> >>>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +static bool x86_user_pred_updated;
> >>>>>>>> +static u64 x86_user_pred_reg_mask;
> >>>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
> >>>>>>>> +
> >>>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool supported;
> >>>>>>>> +       u64 mask = 0;
> >>>>>>>> +       int reg;
> >>>>>>>> +
> >>>>>>>> +       if (!has_cap_simd_regs())
> >>>>>>>> +               return 0;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
> >>>>>>>> +               return x86_intr_simd_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
> >>>>>>>> +               return x86_user_simd_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>>>> +               supported = false;
> >>>>>>>> +
> >>>>>>>> +               if (!r->mask)
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>>>> +
> >>>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
> >>>>>>>> +                       break;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_intr_simd_mask[reg],
> >>>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
> >>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_user_simd_mask[reg],
> >>>>>>>> +                                                        &x86_user_simd_qwords[reg]);
> >>>>>>>> +               if (supported)
> >>>>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>>>> +               x86_intr_simd_reg_mask = mask;
> >>>>>>>> +               x86_intr_simd_updated = true;
> >>>>>>>> +       } else {
> >>>>>>>> +               x86_user_simd_reg_mask = mask;
> >>>>>>>> +               x86_user_simd_updated = true;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool supported;
> >>>>>>>> +       u64 mask = 0;
> >>>>>>>> +       int reg;
> >>>>>>>> +
> >>>>>>>> +       if (!has_cap_simd_regs())
> >>>>>>>> +               return 0;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
> >>>>>>>> +               return x86_intr_pred_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
> >>>>>>>> +               return x86_user_pred_reg_mask;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>>>> +               supported = false;
> >>>>>>>> +
> >>>>>>>> +               if (!r->mask)
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg = fls64(r->mask) - 1;
> >>>>>>>> +
> >>>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
> >>>>>>>> +                       break;
> >>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_intr_pred_mask[reg],
> >>>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
> >>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
> >>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
> >>>>>>>> +                                                        &x86_user_pred_mask[reg],
> >>>>>>>> +                                                        &x86_user_pred_qwords[reg]);
> >>>>>>>> +               if (supported)
> >>>>>>>> +                       mask |= BIT_ULL(reg);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
> >>>>>>>> +               x86_intr_pred_reg_mask = mask;
> >>>>>>>> +               x86_intr_pred_updated = true;
> >>>>>>>> +       } else {
> >>>>>>>> +               x86_user_pred_reg_mask = mask;
> >>>>>>>> +               x86_user_pred_updated = true;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = 0;
> >>>>>>>> +
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
> >>>>>>>> +               if (intr) {
> >>>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
> >>>>>>>> +                       mask = x86_intr_simd_mask[reg];
> >>>>>>>> +               } else {
> >>>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
> >>>>>>>> +                       mask = x86_user_simd_mask[reg];
> >>>>>>>> +               }
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = 0;
> >>>>>>>> +
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
> >>>>>>>> +               if (intr) {
> >>>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
> >>>>>>>> +                       mask = x86_intr_pred_mask[reg];
> >>>>>>>> +               } else {
> >>>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
> >>>>>>>> +                       mask = x86_user_pred_mask[reg];
> >>>>>>>> +               }
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_intr_simd_updated)
> >>>>>>>> +               arch__intr_simd_reg_mask();
> >>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_user_simd_updated)
> >>>>>>>> +               arch__user_simd_reg_mask();
> >>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_intr_pred_updated)
> >>>>>>>> +               arch__intr_pred_reg_mask();
> >>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       if (!x86_user_pred_updated)
> >>>>>>>> +               arch__user_pred_reg_mask();
> >>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
> >>>>>>>>  {
> >>>>>>>> +       if (has_cap_simd_regs())
> >>>>>>>> +               return sample_reg_masks_ext;
> >>>>>>>>         return sample_reg_masks;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> -uint64_t arch__intr_reg_mask(void)
> >>>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
> >>>>>>>>  {
> >>>>>>>>         struct perf_event_attr attr = {
> >>>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
> >>>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
> >>>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
> >>>>>>>> -               .precise_ip             = 1,
> >>>>>>>> -               .disabled               = 1,
> >>>>>>>> -               .exclude_kernel         = 1,
> >>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
> >>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
> >>>>>>>> +               .sample_type                    = sample_type,
> >>>>>>>> +               .precise_ip                     = 1,
> >>>>>>>> +               .disabled                       = 1,
> >>>>>>>> +               .exclude_kernel                 = 1,
> >>>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
> >>>>>>>>         };
> >>>>>>>>         int fd;
> >>>>>>>>         /*
> >>>>>>>>          * In an unnamed union, init it here to build on older gcc versions
> >>>>>>>>          */
> >>>>>>>>         attr.sample_period = 1;
> >>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
> >>>>>>>> +               attr.sample_regs_intr = mask;
> >>>>>>>> +       else
> >>>>>>>> +               attr.sample_regs_user = mask;
> >>>>>>>>
> >>>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
> >>>>>>>>                 struct perf_pmu *pmu = NULL;
> >>>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
> >>>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> >>>>>>>>         if (fd != -1) {
> >>>>>>>>                 close(fd);
> >>>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> >>>>>>>> +               return mask;
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>> -       return PERF_REGS_MASK;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t arch__intr_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>>>> +
> >>>>>>>> +       if (has_cap_simd_regs()) {
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>>>> +                                        true);
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> >>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>>>> +                                        true);
> >>>>>>>> +       } else
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>>  uint64_t arch__user_reg_mask(void)
> >>>>>>>>  {
> >>>>>>>> -       return PERF_REGS_MASK;
> >>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
> >>>>>>>> +
> >>>>>>>> +       if (has_cap_simd_regs()) {
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> >>>>>>>> +                                        true);
> >>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> >>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
> >>>>>>>> +                                        true);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return mask;
> >>>>>>>>  }
> >>>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
> >>>>>>>> --- a/tools/perf/util/evsel.c
> >>>>>>>> +++ b/tools/perf/util/evsel.c
> >>>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
> >>>>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
> >>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
> >>>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> >>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
> >>>>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>>>> +               else
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> >>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> >>>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
> >>>>>>>>             !evsel__is_dummy_event(evsel)) {
> >>>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
> >>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
> >>>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> >>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> >>>>>>>> +               if (opts->sample_pred_regs_qwords)
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> >>>>>>>> +               else
> >>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
> >>>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> >>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> >>>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> >>>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> >>>>>>>> index cda1c620968e..0bd100392889 100644
> >>>>>>>> --- a/tools/perf/util/parse-regs-options.c
> >>>>>>>> +++ b/tools/perf/util/parse-regs-options.c
> >>>>>>>> @@ -4,19 +4,139 @@
> >>>>>>>>  #include <stdint.h>
> >>>>>>>>  #include <string.h>
> >>>>>>>>  #include <stdio.h>
> >>>>>>>> +#include <linux/bitops.h>
> >>>>>>>>  #include "util/debug.h"
> >>>>>>>>  #include <subcmd/parse-options.h>
> >>>>>>>>  #include "util/perf_regs.h"
> >>>>>>>>  #include "util/parse-regs-options.h"
> >>>>>>>> +#include "record.h"
> >>>>>>>> +
> >>>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       if (!simd_mask)
> >>>>>>>> +               return;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (!(r->mask & simd_mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               if (bitmap)
> >>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>>>> +       }
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       if (!pred_mask)
> >>>>>>>> +               return;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (!(r->mask & pred_mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               if (bitmap)
> >>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
> >>>>>>>> +       }
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool matched = false;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (strcasecmp(s, r->name))
> >>>>>>>> +                       continue;
> >>>>>>>> +               if (!fls64(r->mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               matched = true;
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       /* Just need the highest qwords */
> >>>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
> >>>>>>>> +               opts->sample_vec_regs_qwords = qwords;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
> >>>>>>>> +               else
> >>>>>>>> +                       opts->sample_user_vec_regs = bitmap;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return matched;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
> >>>>>>>> +{
> >>>>>>>> +       const struct sample_reg *r = NULL;
> >>>>>>>> +       bool matched = false;
> >>>>>>>> +       uint64_t bitmap = 0;
> >>>>>>>> +       u16 qwords = 0;
> >>>>>>>> +       int reg_idx;
> >>>>>>>> +
> >>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
> >>>>>>>> +               if (strcasecmp(s, r->name))
> >>>>>>>> +                       continue;
> >>>>>>>> +               if (!fls64(r->mask))
> >>>>>>>> +                       continue;
> >>>>>>>> +               reg_idx = fls64(r->mask) - 1;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               else
> >>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
> >>>>>>>> +               matched = true;
> >>>>>>>> +               break;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       /* Just need the highest qwords */
> >>>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
> >>>>>>>> +               opts->sample_pred_regs_qwords = qwords;
> >>>>>>>> +               if (intr)
> >>>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
> >>>>>>>> +               else
> >>>>>>>> +                       opts->sample_user_pred_regs = bitmap;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       return matched;
> >>>>>>>> +}
> >>>>>>>>
> >>>>>>>>  static int
> >>>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>  {
> >>>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
> >>>>>>>>         const struct sample_reg *r = NULL;
> >>>>>>>> +       struct record_opts *opts;
> >>>>>>>>         char *s, *os = NULL, *p;
> >>>>>>>> -       int ret = -1;
> >>>>>>>> +       bool has_simd_regs = false;
> >>>>>>>>         uint64_t mask;
> >>>>>>>> +       uint64_t simd_mask;
> >>>>>>>> +       uint64_t pred_mask;
> >>>>>>>> +       int ret = -1;
> >>>>>>>>
> >>>>>>>>         if (unset)
> >>>>>>>>                 return 0;
> >>>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>         if (*mode)
> >>>>>>>>                 return -1;
> >>>>>>>>
> >>>>>>>> -       if (intr)
> >>>>>>>> +       if (intr) {
> >>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
> >>>>>>>>                 mask = arch__intr_reg_mask();
> >>>>>>>> -       else
> >>>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
> >>>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
> >>>>>>>> +       } else {
> >>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
> >>>>>>>>                 mask = arch__user_reg_mask();
> >>>>>>>> +               simd_mask = arch__user_simd_reg_mask();
> >>>>>>>> +               pred_mask = arch__user_pred_reg_mask();
> >>>>>>>> +       }
> >>>>>>>>
> >>>>>>>>         /* str may be NULL in case no arg is passed to -I */
> >>>>>>>>         if (str) {
> >>>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>                                         if (r->mask & mask)
> >>>>>>>>                                                 fprintf(stderr, "%s ", r->name);
> >>>>>>>>                                 }
> >>>>>>>> +                               __print_simd_regs(intr, simd_mask);
> >>>>>>>> +                               __print_pred_regs(intr, pred_mask);
> >>>>>>>>                                 fputc('\n', stderr);
> >>>>>>>>                                 /* just printing available regs */
> >>>>>>>>                                 goto error;
> >>>>>>>>                         }
> >>>>>>>> +
> >>>>>>>> +                       if (simd_mask) {
> >>>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
> >>>>>>>> +                               if (has_simd_regs)
> >>>>>>>> +                                       goto next;
> >>>>>>>> +                       }
> >>>>>>>> +                       if (pred_mask) {
> >>>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
> >>>>>>>> +                               if (has_simd_regs)
> >>>>>>>> +                                       goto next;
> >>>>>>>> +                       }
> >>>>>>>> +
> >>>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
> >>>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
> >>>>>>>>                                         break;
> >>>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>                         }
> >>>>>>>>
> >>>>>>>>                         *mode |= r->mask;
> >>>>>>>> -
> >>>>>>>> +next:
> >>>>>>>>                         if (!p)
> >>>>>>>>                                 break;
> >>>>>>>>
> >>>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
> >>>>>>>>         ret = 0;
> >>>>>>>>
> >>>>>>>>         /* default to all possible regs */
> >>>>>>>> -       if (*mode == 0)
> >>>>>>>> +       if (*mode == 0 && !has_simd_regs)
> >>>>>>>>                 *mode = mask;
> >>>>>>>>  error:
> >>>>>>>>         free(os);
> >>>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>>>> index 66b666d9ce64..fb0366d050cf 100644
> >>>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> >>>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
> >>>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
> >>>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
> >>>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> >>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
> >>>>>>>>
> >>>>>>>>         return ret;
> >>>>>>>>  }
> >>>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> >>>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
> >>>>>>>> --- a/tools/perf/util/perf_regs.c
> >>>>>>>> +++ b/tools/perf/util/perf_regs.c
> >>>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
> >>>>>>>>         return SDT_ARG_SKIP;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
> >>>>>>>> +{
> >>>>>>>> +       return false;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
> >>>>>>>>  {
> >>>>>>>>         return 0;
> >>>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
> >>>>>>>>         return 0;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
> >>>>>>>> +{
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
> >>>>>>>> +{
> >>>>>>>> +       *qwords = 0;
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
> >>>>>>>>         SMPL_REG_END
> >>>>>>>>  };
> >>>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
> >>>>>>>>         return sample_reg_masks;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_reg_masks;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
> >>>>>>>> +{
> >>>>>>>> +       return sample_reg_masks;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>  const char *perf_reg_name(int id, const char *arch)
> >>>>>>>>  {
> >>>>>>>>         const char *reg_name = NULL;
> >>>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> >>>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
> >>>>>>>> --- a/tools/perf/util/perf_regs.h
> >>>>>>>> +++ b/tools/perf/util/perf_regs.h
> >>>>>>>> @@ -24,9 +24,20 @@ enum {
> >>>>>>>>  };
> >>>>>>>>
> >>>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
> >>>>>>>> +bool arch_has_simd_regs(u64 mask);
> >>>>>>>>  uint64_t arch__intr_reg_mask(void);
> >>>>>>>>  uint64_t arch__user_reg_mask(void);
> >>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
> >>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> >>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
> >>>>>>> I wonder we can remove these functions. perf_reg_name(int id, uint16_t
> >>>>>>> e_machine) maps a perf register number and e_machine to a string. So
> >>>>>>> the sample_reg array could be replaced with:
> >>>>>>> ```
> >>>>>>> for (int perf_reg = 0; perf_reg < 64; perf_reg++) {
> >>>>>>>   uint64_t mask = 1LL << perf_reg;
> >>>>>>>   const char *name = perf_reg_name(perf_reg, EM_HOST);
> >>>>>>>   if (name == NULL)
> >>>>>>>     break;
> >>>>>>>   // use mask and name
> >>>>>>> ```
> >>>>>>> To make it work for SIMD and PRED then I guess we need to iterate
> >>>>>>> through the ABIs of enum perf_sample_regs_abi.
> >>>>>> Suppose so.
> >>>>>>
> >>>>>>
> >>>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
> >>>>>>>> +uint64_t arch__user_simd_reg_mask(void);
> >>>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
> >>>>>>>> +uint64_t arch__user_pred_reg_mask(void);
> >>>>>>> I think some comments would be useful here like:
> >>>>>>> ```
> >>>>>>> /* Perf register bit map with valid bits for
> >>>>>>> perf_event_attr.sample_regs_user. */
> >>>>>>> uint64_t arch__intr_reg_mask(void);
> >>>>>>> /* Perf register bit map with valid bits for
> >>>>>>> perf_event_attr.sample_regs_intr. */
> >>>>>>> uint64_t arch__user_reg_mask(void);
> >>>>>>> /* Perf register bit map with valid bits for
> >>>>>>> perf_event_attr.sample_simd_vec_reg_intr. */
> >>>>>>> uint64_t arch__intr_simd_reg_mask(void);
> >>>>>>> /* Perf register bit map with valid bits for
> >>>>>>> perf_event_attr.sample_simd_vec_reg_user. */
> >>>>>>> uint64_t arch__user_simd_reg_mask(void);
> >>>>>>> /* Perf register bit map with valid bits for
> >>>>>>> perf_event_attr.sample_simd_pred_reg_intr. */
> >>>>>>> uint64_t arch__intr_pred_reg_mask(void);
> >>>>>>> /* Perf register bit map with valid bits for
> >>>>>>> perf_event_attr.sample_simd_pred_reg_user. */
> >>>>>>> uint64_t arch__user_pred_reg_mask(void);
> >>>>>> Sure. Thanks.
> >>>>>>
> >>>>>>
> >>>>>>> ```
> >>>>>>>
> >>>>>>> Why do the arch__user_pred_reg_mask return a uint64_t when the
> >>>>>>> perf_event_attr variable is a __u32?
> >>>>>> Suppose it's a bug. :)
> >>>>>>
> >>>>>>
> >>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
> >>>>>>> I don't understand this function. The qwords is specific to a
> >>>>>>> perf_event_attr. We could have an evlist with an evsel set up to
> >>>>>>> sample say XMM registers and another evsel set up to sample ZMM
> >>>>>>> registers. Are the qwords here always for the ZMM case, or is XMM,
> >>>>>>> YMM, ZMM depending on architecture support? Why does it vary per
> >>>>>>> register? The surrounding code uses the term mask but here bitmap is
> >>>>>>> used, is the inconsistency deliberate? Why are there user and intr
> >>>>>>> functions when in the perf_event_attr there are only
> >>>>>>> sample_simd_pred_reg_qwords and sample_simd_ved_reg_qwords variables?
> >>>>>> These 4 functions is designed to get the bitmask and qwords length for a
> >>>>>> specific SIMD registers. E.g., For XMM on x86 platforms, the returned
> >>>>>> bitmask is 0xffff (xmm0 ~ xmm15) and the qwords length is 2 (128 bits). For
> >>>>>> ZMM on x86 platforms, if the platform only supports 16 ZMM registers, then
> >>>>>> the returned bitmask is 0xffff (zmm0 ~ zmm15) and qwords length is 8 (512
> >>>>>> bits). If the platform supports 32 ZMM registers, then the returned bitmask
> >>>>>> is 0xffffffff (zmm0 ~ zmm31) and qwords length is 8 (512 bits).
> >>>>> What is the meaning of reg? In this file it is normally the integer
> >>>>> index for a bit in the sample_regs_user mask, but for x86 I don't see
> >>>>> enum perf_event_x86_regs having differing XMM, YMM and ZMM encodings.
> >>>>> Similarly, is qwords an out argument, but then you also have the
> >>>>> bitmap. It looks like the code is caching values but that assumes a
> >>>>> single qword length for all events.
> >>>> Yes, the "reg" argument indicates the SIMD register index. Strictly
> >>>> speaking for x86 platform, the qwords length is fixed for a specific SIMD
> >>>> register and only the register number could vary, e.g., some platforms
> >>>> could only support 16 ZMM registers, but some other platforms could support
> >>>> 32 ZMM registers. But considering this is a generic function for all kinds
> >>>> of archs, we can't ensure there are fixed length for a specific SIMD
> >>>> register on any arch, so I introduce  the "qwords" argument to increase the
> >>>> flexibility.
> >>> I'm still not understanding this still :-) What is a "SIMD register
> >>> index", the file is for perf registers and naturally enum
> >>> perf_event_x86_regs on x86, but that doesn't encode YMM and ZMM
> >>> registers. Perhaps you can give some examples?
> >> Yes, it's something just like the register index in the enum
> >> perf_event_x86_regs, e.g. the index of AX register is PERF_REG_X86_AX, the
> >> index of BX is PERF_REG_X86_BX, and so on.
> >>
> >> But the difference is that each index in the perf_event_x86_regs can only
> >> represent a u64 word. Assume we still want to represent the SIMD registers
> >> with the perf_event_x86_regs enum, then each XMM register needs 2 indexes,
> >> each YMM register needs 4 indexes and each ZMM needs 8 indexes. Considering
> >> there are 16 XMM registers, 16 YMM registers and 32 ZMM registers. To
> >> represent all these indexes, then the enum perf_event_x86_regs would become
> >> quite large, and correspondingly the sample_regs_intr/sample_regs_user
> >> fields in the perf_event_attr would have to inflate much. That would
> >> consume much memory.
> >>
> >> So that's why we introduce the new below attributes.
> >>
> >> + union { + __u16 sample_simd_regs_enabled; + __u16
> >> sample_simd_pred_reg_qwords; + }; + __u32 sample_simd_pred_reg_intr; +
> >> __u32 sample_simd_pred_reg_user; + __u16 sample_simd_vec_reg_qwords; +
> >> __u64 sample_simd_vec_reg_intr; + __u64 sample_simd_vec_reg_user; + __u32
> >> __reserved_4; For SIMD registers, each kind of SIMD register would be
> >> treated as a whole. The sample_simd_vec_reg_qwords would be used to
> >> identify the length of SIMD register, simultaneously it also hint which
> >> kind of SIMD register it is since the length of each kind of SIMD register
> >> is different. E.g. we want to sample XMM registers. We know there are 16
> >> XMM registers on the x86 platform and qwords length of XMM register is 2.
> >> So user space needs to set the attributes like this,
> >>
> >> sample_simd_vec_reg_intr = 0xffff;
> >>
> >> sample_simd_vec_reg_qwords = 2;
> >>
> >> Come back to "reg" argument, we know there could be multiple kinds of SIMD
> >> registers supported on some kind of arch, e.g., x86 support XMM, YMM, ZMM
> >> and OPMASK SIMD registers. As each kind of SIMD register is always sampled
> >> as a whole, we don't need to represent each of SIMD register, like XMM0,
> >> XMM1, but we indeed need to distinguish different kinds of SIMD register,
> >> like XMM and YMM registers, since they have different register length and
> >> number.
> >>
> >> That's why we define the index for each kind of SIMD register, like below,
> >>
> >> +enum { + PERF_REG_X86_XMM, + PERF_REG_X86_YMM, + PERF_REG_X86_ZMM, +
> >> PERF_REG_X86_MAX_SIMD_REGS, + + PERF_REG_X86_OPMASK = 0, +
> >> PERF_REG_X86_MAX_PRED_REGS = 1, +}; It's similar withperf_event_x86_regs, but each index represents a kind of SIMD register instead of a specific SIMD register.
> > Could you give me an example call to say
> > arch__intr_simd_reg_bitmap_qwords where you say what the value of reg
> > is, what the expected value of qwords is and what the result will be?
> > Could you do it for say a model without AVX, a model with AVX, a model
> > with AVX512 and a model with APX.
>
> Assume we are on a x86 platform which only supports XMM registers (AVX) and
> call the function arch__intr_simd_reg_bitmap_qwords() with SIMD register index,
>
> 1. reg = PERF_REG_X86_XMM
>
> The return value (XMM registers bitmask) = 0xffff and the qwords = 2 (128 bits).

Thanks!
Can we rename PERF_REG_X86_XMM to say PERF_REG_CLASS_X86_XMM
(similarly reg to reg_class), currently the name is very close to
PERF_REG_X86_XMM0 but that value is in a different enum.
So the bitmask is in terms of the qwords whilst the regular perf
register mask is 1 64-bit qword per bit.

> 2. reg = PERF_REG_X86_YMM
>
> The return value (YMM registers bitmask) = 0 and the qwords = 0 since YMM registers are not supported.
>
> 3. reg = PERF_REG_X86_ZMM
>
> The return value (ZMM registers bitmask) = 0 and the qwords = 0 since ZMM registers are not supported.

Ok.

> Assume we are on a x86 platform which supports XMM/YMM/ZMM registers (AVX512) and call the function arch__intr_simd_reg_bitmap_qwords() with SIMD register index,
>
> 1. reg = PERF_REG_X86_XMM
>
> The return value (XMM registers bitmask) = 0xffff and the qwords = 2 (128 bits).
>
> 2. reg = PERF_REG_X86_YMM
>
> The return value (YMM registers bitmask) = 0xffff and the qwords = 4 (256 bits).

Ok, qwords got bigger.

> 3. reg = PERF_REG_X86_ZMM
>
> The return value (ZMM registers bitmask) = 0xffffffff and the qwords = 8 (512 bits). We assume this platform supports 32 ZMM registers (ZMM0 ~ ZMM31).

Wouldn't it then also support 32 YMM and XMM registers in the 2 cases above?

> As for APX, it has nothing to do with these 4 functions, whether it's supported is determined by the helpers arch__intr_reg_mask()/arch__user_reg_mask().
> e.g.,
>
> ```
>         if (has_cap_simd_regs()) {
>                 mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>                                          GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>                                          true);
>                 mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>                                          BIT_ULL(PERF_REG_X86_SSP),
>                                          true);
> ```
> If the platform supports APX eGPRs, then the returned mask from arch__intr_reg_mask()/arch__user_reg_mask() would contain the eGPRs mask, otherwise, it would not.

Thanks for the explanation. In the perf_regs.h why do YMMH and ZMMH
also appear but not as arguments for arch__intr_simd_reg_bitmap_qwords
?

> >
> > I have looked at the code and read the changes to perf_event_attr
> > which is why I was confused by your saying that ZMM could be passed in
> > as a perf register number. I am confused as why when the
> > perf_event_attr has 2 qword length related variables this code seems
> > to be setting things up so that every register can have a qword
> > length. I'm confused what is happening with the return value of this
> > function. As values are being stored into global variables, and you
> > are saying they aren't a max value, then how does this impact the
> > setting up of multiple register sampling events?
>
> The 2 qwords length, sample_simd_pred_reg_qwords is to store the PRED
> register length, and sample_simd_vec_reg_qwords is to store the SIMD
> register length. Since the SIMD/PRED registers with larger length would
> contain the SIMD/PRED register with shorter length, so only the largest
> length would be set  to into the variables
> sample_simd_vec_reg_qwords/sample_simd_vec_reg_qwords .
>
> E.g.,
>
> perf record -e cyles:p -Ixmm,ymm,zmm -c 10000 -- sleep 1
>
> The sample_simd_vec_reg_qwords would be set to 8 to represent the largest
> length (ZMM) and kernel directly samples ZMM registers since ZMM registers
> fully contains YMM and XMM registers.
>
> The reason that caching bitmask and qwords is that the bitmask and qwords
> for a specific SIMD/PRED register is fixed on a certain x86 platform,
> right?  E.g. the qwords length of XMM register is always 2, YMM is 4, etc...

Ok. My confusion is the overloaded meaning of a perf register in this
file, hence it'd be nice to make the names more distinct.

> The bitmask and qwords values are retrieved from kernel by
> perf_event_open() syscall which is quite expensive, if it's called
> frequently, it would impact the performance heavily.

Agreed. It is a shame the existing probing/caching aren't used.

Thanks,
Ian

> >
> > Thanks,
> > Ian
> >
> >>> How does the generic differing qword per register case get encoded
> >>> into a perf_event_attr? If it can't be then this seems like
> >>> functionality for no benefit. I also don't understand how the data in
> >>> the PERF_SAMPLE_REGS_USER part of a sample could be decoded as that is
> >>> assuming a constant qword number.
> >>>
> >>>> No, the qwords would be assigned to true register length if the register
> >>>> exists on the platform, e.g., xmm = 2, ymm =  4 and zmm = 8. if the
> >>>> register is not support on the platfom, the qwords would be set to 0.
> >>> So it is a max function of the vector/pred qwords supported on the architecture.
> >> Strictly speaking, it's not "max" function of the vector/pred qwords, it's
> >> just a function to get the exact vector/pred qwords supported on the
> >> architecture since qwords length won't vary for a fixed kind of SIMD register.
> >>
> >>
> >>>>>> Since the qword length is always fixed for any certain SIMD register
> >>>>>> regardless of intr or user, so there is only one
> >>>>>> sample_simd_pred_reg_qwords or sample_simd_ved_reg_qwords variable.
> >>>>> Ok.  2 variables, but 4 functions here. I think there should just be 2
> >>>>> because of this.
> >>>> Yes, the user and intr variants would be merged into only one.
> >>> Thanks,
> >>> Ian
> >>>
> >>>>> Thanks,
> >>>>> Ian
> >>>>>
> >>>>>>> Perhaps these functions should be something more like:
> >>>>>>> ```
> >>>>>>> /* Maximum value that can be assigned to
> >>>>>>> perf_event_atttr.sample_simd_pred_reg_qwords. */
> >>>>>>> uint16_t arch__simd_pred_reg_qwords_max(void);
> >>>>>>> /* Maximum value that can be assigned to
> >>>>>>> perf_event_atttr.sample_simd_vec_reg_qwords. */
> >>>>>>> uint16_t arch__simd_vec_reg_qwords_max(void);
> >>>>>>> ```
> >>>>>>> Then the bitmap computation logic can all be moved into parse-regs-options.c.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Ian
> >>>>>>>
> >>>>>>>>  const char *perf_reg_name(int id, const char *arch);
> >>>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> >>>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> >>>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
> >>>>>>>> --- a/tools/perf/util/record.h
> >>>>>>>> +++ b/tools/perf/util/record.h
> >>>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
> >>>>>>>>         unsigned int  user_freq;
> >>>>>>>>         u64           branch_stack;
> >>>>>>>>         u64           sample_intr_regs;
> >>>>>>>> +       u64           sample_intr_vec_regs;
> >>>>>>>>         u64           sample_user_regs;
> >>>>>>>> +       u64           sample_user_vec_regs;
> >>>>>>>> +       u16           sample_pred_regs_qwords;
> >>>>>>>> +       u16           sample_vec_regs_qwords;
> >>>>>>>> +       u16           sample_intr_pred_regs;
> >>>>>>>> +       u16           sample_user_pred_regs;
> >>>>>>>>         u64           default_interval;
> >>>>>>>>         u64           user_interval;
> >>>>>>>>         size_t        auxtrace_snapshot_size;
> >>>>>>>> --
> >>>>>>>> 2.34.1
> >>>>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
  2026-01-22  7:27                   ` Ian Rogers
@ 2026-01-22  8:29                     ` Mi, Dapeng
  0 siblings, 0 replies; 86+ messages in thread
From: Mi, Dapeng @ 2026-01-22  8:29 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Adrian Hunter,
	Jiri Olsa, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Zide Chen, Falcon Thomas, Dapeng Mi, Xudong Hao,
	Kan Liang


On 1/22/2026 3:27 PM, Ian Rogers wrote:
> On Wed, Jan 21, 2026 at 5:49 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 1/21/2026 10:48 PM, Ian Rogers wrote:
>>> On Tue, Jan 20, 2026 at 11:52 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>> On 1/21/2026 3:09 PM, Ian Rogers wrote:
>>>>> On Tue, Jan 20, 2026 at 9:17 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>> On 1/21/2026 2:20 AM, Ian Rogers wrote:
>>>>>>> On Tue, Jan 20, 2026 at 1:04 AM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>> On 1/20/2026 3:39 PM, Ian Rogers wrote:
>>>>>>>>> On Tue, Dec 2, 2025 at 10:59 PM Dapeng Mi <dapeng1.mi@linux.intel.com> wrote:
>>>>>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>>>>
>>>>>>>>>> This patch adds support for the newly introduced SIMD register sampling
>>>>>>>>>> format by adding the following functions:
>>>>>>>>>>
>>>>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>>>>>> uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_simd_reg_mask() functions retrieve the bitmap of
>>>>>>>>>> supported SIMD registers, such as XMM/YMM/ZMM on x86 platforms.
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_pred_reg_mask() functions retrieve the bitmap of
>>>>>>>>>> supported PRED registers, such as OPMASK on x86 platforms.
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_simd_reg_bitmap_qwords() functions provide the
>>>>>>>>>> exact bitmap and number of qwords for a specific type of SIMD register.
>>>>>>>>>> For example, for XMM registers on x86 platforms, the returned bitmap is
>>>>>>>>>> 0xffff (XMM0 ~ XMM15) and the qwords number is 2 (128 bits for each XMM).
>>>>>>>>>>
>>>>>>>>>> The arch__{intr|user}_pred_reg_bitmap_qwords() functions provide the
>>>>>>>>>> exact bitmap and number of qwords for a specific type of PRED register.
>>>>>>>>>> For example, for OPMASK registers on x86 platforms, the returned bitmap
>>>>>>>>>> is 0xff (OPMASK0 ~ OPMASK7) and the qwords number is 1 (64 bits for each
>>>>>>>>>> OPMASK).
>>>>>>>>>>
>>>>>>>>>> Additionally, the function __parse_regs() is enhanced to support parsing
>>>>>>>>>> these newly introduced SIMD registers. Currently, each type of register
>>>>>>>>>> can only be sampled collectively; sampling a specific SIMD register is
>>>>>>>>>> not supported. For example, all XMM registers are sampled together rather
>>>>>>>>>> than sampling only XMM0.
>>>>>>>>>>
>>>>>>>>>> When multiple overlapping register types, such as XMM and YMM, are
>>>>>>>>>> sampled simultaneously, only the superset (YMM registers) is sampled.
>>>>>>>>>>
>>>>>>>>>> With this patch, all supported sampling registers on x86 platforms are
>>>>>>>>>> displayed as follows.
>>>>>>>>>>
>>>>>>>>>>  $perf record -I?
>>>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>>>
>>>>>>>>>>  $perf record --user-regs=?
>>>>>>>>>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>>>>>>>>>  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28
>>>>>>>>>>  R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>>>>>>>> ---
>>>>>>>>>>  tools/perf/arch/x86/util/perf_regs.c      | 470 +++++++++++++++++++++-
>>>>>>>>>>  tools/perf/util/evsel.c                   |  27 ++
>>>>>>>>>>  tools/perf/util/parse-regs-options.c      | 151 ++++++-
>>>>>>>>>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>>>>>>>>>  tools/perf/util/perf_regs.c               |  59 +++
>>>>>>>>>>  tools/perf/util/perf_regs.h               |  11 +
>>>>>>>>>>  tools/perf/util/record.h                  |   6 +
>>>>>>>>>>  7 files changed, 714 insertions(+), 16 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>> index 12fd93f04802..db41430f3b07 100644
>>>>>>>>>> --- a/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>>>>>>>>>> @@ -13,6 +13,49 @@
>>>>>>>>>>  #include "../../../util/pmu.h"
>>>>>>>>>>  #include "../../../util/pmus.h"
>>>>>>>>>>
>>>>>>>>>> +static const struct sample_reg sample_reg_masks_ext[] = {
>>>>>>>>>> +       SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>>>> +       SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>>>> +       SMPL_REG(CX, PERF_REG_X86_CX),
>>>>>>>>>> +       SMPL_REG(DX, PERF_REG_X86_DX),
>>>>>>>>>> +       SMPL_REG(SI, PERF_REG_X86_SI),
>>>>>>>>>> +       SMPL_REG(DI, PERF_REG_X86_DI),
>>>>>>>>>> +       SMPL_REG(BP, PERF_REG_X86_BP),
>>>>>>>>>> +       SMPL_REG(SP, PERF_REG_X86_SP),
>>>>>>>>>> +       SMPL_REG(IP, PERF_REG_X86_IP),
>>>>>>>>>> +       SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>>>>>>>>>> +       SMPL_REG(CS, PERF_REG_X86_CS),
>>>>>>>>>> +       SMPL_REG(SS, PERF_REG_X86_SS),
>>>>>>>>>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>>>>>>>>>> +       SMPL_REG(R8, PERF_REG_X86_R8),
>>>>>>>>>> +       SMPL_REG(R9, PERF_REG_X86_R9),
>>>>>>>>>> +       SMPL_REG(R10, PERF_REG_X86_R10),
>>>>>>>>>> +       SMPL_REG(R11, PERF_REG_X86_R11),
>>>>>>>>>> +       SMPL_REG(R12, PERF_REG_X86_R12),
>>>>>>>>>> +       SMPL_REG(R13, PERF_REG_X86_R13),
>>>>>>>>>> +       SMPL_REG(R14, PERF_REG_X86_R14),
>>>>>>>>>> +       SMPL_REG(R15, PERF_REG_X86_R15),
>>>>>>>>>> +       SMPL_REG(R16, PERF_REG_X86_R16),
>>>>>>>>>> +       SMPL_REG(R17, PERF_REG_X86_R17),
>>>>>>>>>> +       SMPL_REG(R18, PERF_REG_X86_R18),
>>>>>>>>>> +       SMPL_REG(R19, PERF_REG_X86_R19),
>>>>>>>>>> +       SMPL_REG(R20, PERF_REG_X86_R20),
>>>>>>>>>> +       SMPL_REG(R21, PERF_REG_X86_R21),
>>>>>>>>>> +       SMPL_REG(R22, PERF_REG_X86_R22),
>>>>>>>>>> +       SMPL_REG(R23, PERF_REG_X86_R23),
>>>>>>>>>> +       SMPL_REG(R24, PERF_REG_X86_R24),
>>>>>>>>>> +       SMPL_REG(R25, PERF_REG_X86_R25),
>>>>>>>>>> +       SMPL_REG(R26, PERF_REG_X86_R26),
>>>>>>>>>> +       SMPL_REG(R27, PERF_REG_X86_R27),
>>>>>>>>>> +       SMPL_REG(R28, PERF_REG_X86_R28),
>>>>>>>>>> +       SMPL_REG(R29, PERF_REG_X86_R29),
>>>>>>>>>> +       SMPL_REG(R30, PERF_REG_X86_R30),
>>>>>>>>>> +       SMPL_REG(R31, PERF_REG_X86_R31),
>>>>>>>>>> +       SMPL_REG(SSP, PERF_REG_X86_SSP),
>>>>>>>>>> +#endif
>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>>>         SMPL_REG(AX, PERF_REG_X86_AX),
>>>>>>>>>>         SMPL_REG(BX, PERF_REG_X86_BX),
>>>>>>>>>> @@ -276,27 +319,404 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>>>>>>>>>         return SDT_ARG_VALID;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>>>>>>>>>> +{
>>>>>>>>>> +       struct perf_event_attr attr = {
>>>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>>>> +               .disabled                       = 1,
>>>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>>>> +               .sample_simd_regs_enabled       = 1,
>>>>>>>>>> +       };
>>>>>>>>>> +       int fd;
>>>>>>>>>> +
>>>>>>>>>> +       attr.sample_period = 1;
>>>>>>>>>> +
>>>>>>>>>> +       if (!pred) {
>>>>>>>>>> +               attr.sample_simd_vec_reg_qwords = qwords;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       attr.sample_simd_vec_reg_intr = mask;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr.sample_simd_vec_reg_user = mask;
>>>>>>>>>> +       } else {
>>>>>>>>>> +               attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>>>> +               struct perf_pmu *pmu = NULL;
>>>>>>>>>> +               __u64 type = PERF_TYPE_RAW;
>>>>>>>>>> +
>>>>>>>>>> +               /*
>>>>>>>>>> +                * The same register set is supported among different hybrid PMUs.
>>>>>>>>>> +                * Only check the first available one.
>>>>>>>>>> +                */
>>>>>>>>>> +               while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>>>>>>>>>> +                       type = pmu->type;
>>>>>>>>>> +                       break;
>>>>>>>>>> +               }
>>>>>>>>>> +               attr.config |= type << PERF_PMU_TYPE_SHIFT;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       event_attr_init(&attr);
>>>>>>>>>> +
>>>>>>>>>> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>>>> +       if (fd != -1) {
>>>>>>>>>> +               close(fd);
>>>>>>>>>> +               return true;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return false;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       bool supported = false;
>>>>>>>>>> +       u64 bits;
>>>>>>>>>> +
>>>>>>>>>> +       *mask = 0;
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +
>>>>>>>>>> +       switch (reg) {
>>>>>>>>>> +       case PERF_REG_X86_XMM:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_XMM_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       case PERF_REG_X86_YMM:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_YMM_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       case PERF_REG_X86_ZMM:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_ZMM_QWORDS;
>>>>>>>>>> +                       break;
>>>>>>>>>> +               }
>>>>>>>>>> +
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_ZMMH_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       default:
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return supported;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       bool supported = false;
>>>>>>>>>> +       u64 bits;
>>>>>>>>>> +
>>>>>>>>>> +       *mask = 0;
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +
>>>>>>>>>> +       switch (reg) {
>>>>>>>>>> +       case PERF_REG_X86_OPMASK:
>>>>>>>>>> +               bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
>>>>>>>>>> +               supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
>>>>>>>>>> +               if (supported) {
>>>>>>>>>> +                       *mask = bits;
>>>>>>>>>> +                       *qwords = PERF_X86_OPMASK_QWORDS;
>>>>>>>>>> +               }
>>>>>>>>>> +               break;
>>>>>>>>>> +       default:
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return supported;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool has_cap_simd_regs(void)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
>>>>>>>>>> +       u16 qwords = PERF_X86_XMM_QWORDS;
>>>>>>>>>> +       static bool has_cap_simd_regs;
>>>>>>>>>> +       static bool cached;
>>>>>>>>>> +
>>>>>>>>>> +       if (cached)
>>>>>>>>>> +               return has_cap_simd_regs;
>>>>>>>>>> +
>>>>>>>>>> +       has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>>>> +       has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>> +                                                PERF_REG_X86_XMM, &mask, &qwords);
>>>>>>>>>> +       cached = true;
>>>>>>>>>> +
>>>>>>>>>> +       return has_cap_simd_regs;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +bool arch_has_simd_regs(u64 mask)
>>>>>>>>>> +{
>>>>>>>>>> +       return has_cap_simd_regs() &&
>>>>>>>>>> +              mask & GENMASK_ULL(PERF_REG_X86_SSP, PERF_REG_X86_R16);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static const struct sample_reg sample_simd_reg_masks[] = {
>>>>>>>>>> +       SMPL_REG(XMM, PERF_REG_X86_XMM),
>>>>>>>>>> +       SMPL_REG(YMM, PERF_REG_X86_YMM),
>>>>>>>>>> +       SMPL_REG(ZMM, PERF_REG_X86_ZMM),
>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +static const struct sample_reg sample_pred_reg_masks[] = {
>>>>>>>>>> +       SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
>>>>>>>>>> +       SMPL_REG_END
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_simd_reg_masks;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_pred_reg_masks;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool x86_intr_simd_updated;
>>>>>>>>>> +static u64 x86_intr_simd_reg_mask;
>>>>>>>>>> +static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>> +static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>> +static bool x86_user_simd_updated;
>>>>>>>>>> +static u64 x86_user_simd_reg_mask;
>>>>>>>>>> +static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>> +static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
>>>>>>>>>> +
>>>>>>>>>> +static bool x86_intr_pred_updated;
>>>>>>>>>> +static u64 x86_intr_pred_reg_mask;
>>>>>>>>>> +static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +static bool x86_user_pred_updated;
>>>>>>>>>> +static u64 x86_user_pred_reg_mask;
>>>>>>>>>> +static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t __arch__simd_reg_mask(u64 sample_type)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool supported;
>>>>>>>>>> +       u64 mask = 0;
>>>>>>>>>> +       int reg;
>>>>>>>>>> +
>>>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>>>> +               return 0;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_simd_updated)
>>>>>>>>>> +               return x86_intr_simd_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_simd_updated)
>>>>>>>>>> +               return x86_user_simd_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>> +               supported = false;
>>>>>>>>>> +
>>>>>>>>>> +               if (!r->mask)
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>>>> +
>>>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
>>>>>>>>>> +                       break;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_intr_simd_mask[reg],
>>>>>>>>>> +                                                        &x86_intr_simd_qwords[reg]);
>>>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>>>> +                       supported = __arch_simd_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_user_simd_mask[reg],
>>>>>>>>>> +                                                        &x86_user_simd_qwords[reg]);
>>>>>>>>>> +               if (supported)
>>>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>>>> +               x86_intr_simd_reg_mask = mask;
>>>>>>>>>> +               x86_intr_simd_updated = true;
>>>>>>>>>> +       } else {
>>>>>>>>>> +               x86_user_simd_reg_mask = mask;
>>>>>>>>>> +               x86_user_simd_updated = true;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t __arch__pred_reg_mask(u64 sample_type)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool supported;
>>>>>>>>>> +       u64 mask = 0;
>>>>>>>>>> +       int reg;
>>>>>>>>>> +
>>>>>>>>>> +       if (!has_cap_simd_regs())
>>>>>>>>>> +               return 0;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR && x86_intr_pred_updated)
>>>>>>>>>> +               return x86_intr_pred_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_USER && x86_user_pred_updated)
>>>>>>>>>> +               return x86_user_pred_reg_mask;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>> +               supported = false;
>>>>>>>>>> +
>>>>>>>>>> +               if (!r->mask)
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg = fls64(r->mask) - 1;
>>>>>>>>>> +
>>>>>>>>>> +               if (reg >= PERF_REG_X86_MAX_PRED_REGS)
>>>>>>>>>> +                       break;
>>>>>>>>>> +               if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_intr_pred_mask[reg],
>>>>>>>>>> +                                                        &x86_intr_pred_qwords[reg]);
>>>>>>>>>> +               else if (sample_type == PERF_SAMPLE_REGS_USER)
>>>>>>>>>> +                       supported = __arch_pred_reg_mask(sample_type, reg,
>>>>>>>>>> +                                                        &x86_user_pred_mask[reg],
>>>>>>>>>> +                                                        &x86_user_pred_qwords[reg]);
>>>>>>>>>> +               if (supported)
>>>>>>>>>> +                       mask |= BIT_ULL(reg);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR) {
>>>>>>>>>> +               x86_intr_pred_reg_mask = mask;
>>>>>>>>>> +               x86_intr_pred_updated = true;
>>>>>>>>>> +       } else {
>>>>>>>>>> +               x86_user_pred_reg_mask = mask;
>>>>>>>>>> +               x86_user_pred_updated = true;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = 0;
>>>>>>>>>> +
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
>>>>>>>>>> +               if (intr) {
>>>>>>>>>> +                       *qwords = x86_intr_simd_qwords[reg];
>>>>>>>>>> +                       mask = x86_intr_simd_mask[reg];
>>>>>>>>>> +               } else {
>>>>>>>>>> +                       *qwords = x86_user_simd_qwords[reg];
>>>>>>>>>> +                       mask = x86_user_simd_mask[reg];
>>>>>>>>>> +               }
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = 0;
>>>>>>>>>> +
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       if (reg < PERF_REG_X86_MAX_PRED_REGS) {
>>>>>>>>>> +               if (intr) {
>>>>>>>>>> +                       *qwords = x86_intr_pred_qwords[reg];
>>>>>>>>>> +                       mask = x86_intr_pred_mask[reg];
>>>>>>>>>> +               } else {
>>>>>>>>>> +                       *qwords = x86_user_pred_qwords[reg];
>>>>>>>>>> +                       mask = x86_user_pred_mask[reg];
>>>>>>>>>> +               }
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_intr_simd_updated)
>>>>>>>>>> +               arch__intr_simd_reg_mask();
>>>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_user_simd_updated)
>>>>>>>>>> +               arch__user_simd_reg_mask();
>>>>>>>>>> +       return arch__simd_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_intr_pred_updated)
>>>>>>>>>> +               arch__intr_pred_reg_mask();
>>>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, true);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       if (!x86_user_pred_updated)
>>>>>>>>>> +               arch__user_pred_reg_mask();
>>>>>>>>>> +       return arch__pred_reg_bitmap_qwords(reg, qwords, false);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void)
>>>>>>>>>>  {
>>>>>>>>>> +       if (has_cap_simd_regs())
>>>>>>>>>> +               return sample_reg_masks_ext;
>>>>>>>>>>         return sample_reg_masks;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> -uint64_t arch__intr_reg_mask(void)
>>>>>>>>>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>>>>>>>>>  {
>>>>>>>>>>         struct perf_event_attr attr = {
>>>>>>>>>> -               .type                   = PERF_TYPE_HARDWARE,
>>>>>>>>>> -               .config                 = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>> -               .sample_type            = PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> -               .sample_regs_intr       = PERF_REG_EXTENDED_MASK,
>>>>>>>>>> -               .precise_ip             = 1,
>>>>>>>>>> -               .disabled               = 1,
>>>>>>>>>> -               .exclude_kernel         = 1,
>>>>>>>>>> +               .type                           = PERF_TYPE_HARDWARE,
>>>>>>>>>> +               .config                         = PERF_COUNT_HW_CPU_CYCLES,
>>>>>>>>>> +               .sample_type                    = sample_type,
>>>>>>>>>> +               .precise_ip                     = 1,
>>>>>>>>>> +               .disabled                       = 1,
>>>>>>>>>> +               .exclude_kernel                 = 1,
>>>>>>>>>> +               .sample_simd_regs_enabled       = has_simd_regs,
>>>>>>>>>>         };
>>>>>>>>>>         int fd;
>>>>>>>>>>         /*
>>>>>>>>>>          * In an unnamed union, init it here to build on older gcc versions
>>>>>>>>>>          */
>>>>>>>>>>         attr.sample_period = 1;
>>>>>>>>>> +       if (sample_type == PERF_SAMPLE_REGS_INTR)
>>>>>>>>>> +               attr.sample_regs_intr = mask;
>>>>>>>>>> +       else
>>>>>>>>>> +               attr.sample_regs_user = mask;
>>>>>>>>>>
>>>>>>>>>>         if (perf_pmus__num_core_pmus() > 1) {
>>>>>>>>>>                 struct perf_pmu *pmu = NULL;
>>>>>>>>>> @@ -318,13 +738,41 @@ uint64_t arch__intr_reg_mask(void)
>>>>>>>>>>         fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>>>>>>>>>         if (fd != -1) {
>>>>>>>>>>                 close(fd);
>>>>>>>>>> -               return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>>>>>>>>>> +               return mask;
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t arch__intr_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>>>> +
>>>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>>>> +                                        true);
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>>>> +                                        true);
>>>>>>>>>> +       } else
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>>  uint64_t arch__user_reg_mask(void)
>>>>>>>>>>  {
>>>>>>>>>> -       return PERF_REGS_MASK;
>>>>>>>>>> +       uint64_t mask = PERF_REGS_MASK;
>>>>>>>>>> +
>>>>>>>>>> +       if (has_cap_simd_regs()) {
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>> +                                        GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>>>>>>>>> +                                        true);
>>>>>>>>>> +               mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>>>>>>>>>> +                                        BIT_ULL(PERF_REG_X86_SSP),
>>>>>>>>>> +                                        true);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return mask;
>>>>>>>>>>  }
>>>>>>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>>>>>>> index 56ebefd075f2..5d1d90cf9488 100644
>>>>>>>>>> --- a/tools/perf/util/evsel.c
>>>>>>>>>> +++ b/tools/perf/util/evsel.c
>>>>>>>>>> @@ -1461,12 +1461,39 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>>>>>>>         if (opts->sample_intr_regs && !evsel->no_aux_samples &&
>>>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>>>                 attr->sample_regs_intr = opts->sample_intr_regs;
>>>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_intr);
>>>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>>>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>>>> +               /* The pred qwords is to implies the set of SIMD registers is used */
>>>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>>>> +               attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>>>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>>>> +               attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>>>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_INTR);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>>>>>>>>>             !evsel__is_dummy_event(evsel)) {
>>>>>>>>>>                 attr->sample_regs_user |= opts->sample_user_regs;
>>>>>>>>>> +               attr->sample_simd_regs_enabled = arch_has_simd_regs(attr->sample_regs_user);
>>>>>>>>>> +               evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>>>>>>>>>> +           !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>>>>>>>>>> +               if (opts->sample_pred_regs_qwords)
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>>>>>>>>>> +               else
>>>>>>>>>> +                       attr->sample_simd_pred_reg_qwords = 1;
>>>>>>>>>> +               attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>>>>>>>>>> +               attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>>>>>>>>>> +               attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>>>>>>>>>>                 evsel__set_sample_bit(evsel, REGS_USER);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>>>>>>>>>> index cda1c620968e..0bd100392889 100644
>>>>>>>>>> --- a/tools/perf/util/parse-regs-options.c
>>>>>>>>>> +++ b/tools/perf/util/parse-regs-options.c
>>>>>>>>>> @@ -4,19 +4,139 @@
>>>>>>>>>>  #include <stdint.h>
>>>>>>>>>>  #include <string.h>
>>>>>>>>>>  #include <stdio.h>
>>>>>>>>>> +#include <linux/bitops.h>
>>>>>>>>>>  #include "util/debug.h"
>>>>>>>>>>  #include <subcmd/parse-options.h>
>>>>>>>>>>  #include "util/perf_regs.h"
>>>>>>>>>>  #include "util/parse-regs-options.h"
>>>>>>>>>> +#include "record.h"
>>>>>>>>>> +
>>>>>>>>>> +static void __print_simd_regs(bool intr, uint64_t simd_mask)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       if (!simd_mask)
>>>>>>>>>> +               return;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (!(r->mask & simd_mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               if (bitmap)
>>>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>>>> +       }
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static void __print_pred_regs(bool intr, uint64_t pred_mask)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       if (!pred_mask)
>>>>>>>>>> +               return;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (!(r->mask & pred_mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               if (bitmap)
>>>>>>>>>> +                       fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
>>>>>>>>>> +       }
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __parse_simd_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool matched = false;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               matched = true;
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       /* Just need the highest qwords */
>>>>>>>>>> +       if (qwords > opts->sample_vec_regs_qwords) {
>>>>>>>>>> +               opts->sample_vec_regs_qwords = qwords;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       opts->sample_intr_vec_regs = bitmap;
>>>>>>>>>> +               else
>>>>>>>>>> +                       opts->sample_user_vec_regs = bitmap;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return matched;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static bool __parse_pred_regs(struct record_opts *opts, char *s, bool intr)
>>>>>>>>>> +{
>>>>>>>>>> +       const struct sample_reg *r = NULL;
>>>>>>>>>> +       bool matched = false;
>>>>>>>>>> +       uint64_t bitmap = 0;
>>>>>>>>>> +       u16 qwords = 0;
>>>>>>>>>> +       int reg_idx;
>>>>>>>>>> +
>>>>>>>>>> +       for (r = arch__sample_pred_reg_masks(); r->name; r++) {
>>>>>>>>>> +               if (strcasecmp(s, r->name))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               if (!fls64(r->mask))
>>>>>>>>>> +                       continue;
>>>>>>>>>> +               reg_idx = fls64(r->mask) - 1;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       bitmap = arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               else
>>>>>>>>>> +                       bitmap = arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
>>>>>>>>>> +               matched = true;
>>>>>>>>>> +               break;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       /* Just need the highest qwords */
>>>>>>>>>> +       if (qwords > opts->sample_pred_regs_qwords) {
>>>>>>>>>> +               opts->sample_pred_regs_qwords = qwords;
>>>>>>>>>> +               if (intr)
>>>>>>>>>> +                       opts->sample_intr_pred_regs = bitmap;
>>>>>>>>>> +               else
>>>>>>>>>> +                       opts->sample_user_pred_regs = bitmap;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       return matched;
>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>>  static int
>>>>>>>>>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>  {
>>>>>>>>>>         uint64_t *mode = (uint64_t *)opt->value;
>>>>>>>>>>         const struct sample_reg *r = NULL;
>>>>>>>>>> +       struct record_opts *opts;
>>>>>>>>>>         char *s, *os = NULL, *p;
>>>>>>>>>> -       int ret = -1;
>>>>>>>>>> +       bool has_simd_regs = false;
>>>>>>>>>>         uint64_t mask;
>>>>>>>>>> +       uint64_t simd_mask;
>>>>>>>>>> +       uint64_t pred_mask;
>>>>>>>>>> +       int ret = -1;
>>>>>>>>>>
>>>>>>>>>>         if (unset)
>>>>>>>>>>                 return 0;
>>>>>>>>>> @@ -27,10 +147,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>         if (*mode)
>>>>>>>>>>                 return -1;
>>>>>>>>>>
>>>>>>>>>> -       if (intr)
>>>>>>>>>> +       if (intr) {
>>>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>>>>>>>>>                 mask = arch__intr_reg_mask();
>>>>>>>>>> -       else
>>>>>>>>>> +               simd_mask = arch__intr_simd_reg_mask();
>>>>>>>>>> +               pred_mask = arch__intr_pred_reg_mask();
>>>>>>>>>> +       } else {
>>>>>>>>>> +               opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>>>>>>>>>                 mask = arch__user_reg_mask();
>>>>>>>>>> +               simd_mask = arch__user_simd_reg_mask();
>>>>>>>>>> +               pred_mask = arch__user_pred_reg_mask();
>>>>>>>>>> +       }
>>>>>>>>>>
>>>>>>>>>>         /* str may be NULL in case no arg is passed to -I */
>>>>>>>>>>         if (str) {
>>>>>>>>>> @@ -50,10 +177,24 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>                                         if (r->mask & mask)
>>>>>>>>>>                                                 fprintf(stderr, "%s ", r->name);
>>>>>>>>>>                                 }
>>>>>>>>>> +                               __print_simd_regs(intr, simd_mask);
>>>>>>>>>> +                               __print_pred_regs(intr, pred_mask);
>>>>>>>>>>                                 fputc('\n', stderr);
>>>>>>>>>>                                 /* just printing available regs */
>>>>>>>>>>                                 goto error;
>>>>>>>>>>                         }
>>>>>>>>>> +
>>>>>>>>>> +                       if (simd_mask) {
>>>>>>>>>> +                               has_simd_regs = __parse_simd_regs(opts, s, intr);
>>>>>>>>>> +                               if (has_simd_regs)
>>>>>>>>>> +                                       goto next;
>>>>>>>>>> +                       }
>>>>>>>>>> +                       if (pred_mask) {
>>>>>>>>>> +                               has_simd_regs = __parse_pred_regs(opts, s, intr);
>>>>>>>>>> +                               if (has_simd_regs)
>>>>>>>>>> +                                       goto next;
>>>>>>>>>> +                       }
>>>>>>>>>> +
>>>>>>>>>>                         for (r = arch__sample_reg_masks(); r->name; r++) {
>>>>>>>>>>                                 if ((r->mask & mask) && !strcasecmp(s, r->name))
>>>>>>>>>>                                         break;
>>>>>>>>>> @@ -65,7 +206,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>                         }
>>>>>>>>>>
>>>>>>>>>>                         *mode |= r->mask;
>>>>>>>>>> -
>>>>>>>>>> +next:
>>>>>>>>>>                         if (!p)
>>>>>>>>>>                                 break;
>>>>>>>>>>
>>>>>>>>>> @@ -75,7 +216,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>>>>>>>>>         ret = 0;
>>>>>>>>>>
>>>>>>>>>>         /* default to all possible regs */
>>>>>>>>>> -       if (*mode == 0)
>>>>>>>>>> +       if (*mode == 0 && !has_simd_regs)
>>>>>>>>>>                 *mode = mask;
>>>>>>>>>>  error:
>>>>>>>>>>         free(os);
>>>>>>>>>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>> index 66b666d9ce64..fb0366d050cf 100644
>>>>>>>>>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>>>>>>>>>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>>>>>>>>>         PRINT_ATTRf(aux_start_paused, p_unsigned);
>>>>>>>>>>         PRINT_ATTRf(aux_pause, p_unsigned);
>>>>>>>>>>         PRINT_ATTRf(aux_resume, p_unsigned);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>>>>>>>>>> +       PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>>>>>>>>>
>>>>>>>>>>         return ret;
>>>>>>>>>>  }
>>>>>>>>>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>>>>>>>>>> index 44b90bbf2d07..e8a9fabc92e6 100644
>>>>>>>>>> --- a/tools/perf/util/perf_regs.c
>>>>>>>>>> +++ b/tools/perf/util/perf_regs.c
>>>>>>>>>> @@ -11,6 +11,11 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
>>>>>>>>>>         return SDT_ARG_SKIP;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +bool __weak arch_has_simd_regs(u64 mask __maybe_unused)
>>>>>>>>>> +{
>>>>>>>>>> +       return false;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  uint64_t __weak arch__intr_reg_mask(void)
>>>>>>>>>>  {
>>>>>>>>>>         return 0;
>>>>>>>>>> @@ -21,6 +26,50 @@ uint64_t __weak arch__user_reg_mask(void)
>>>>>>>>>>         return 0;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +uint64_t __weak arch__intr_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_simd_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__intr_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_pred_reg_mask(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
>>>>>>>>>> +{
>>>>>>>>>> +       *qwords = 0;
>>>>>>>>>> +       return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  static const struct sample_reg sample_reg_masks[] = {
>>>>>>>>>>         SMPL_REG_END
>>>>>>>>>>  };
>>>>>>>>>> @@ -30,6 +79,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>>>>>>>>>         return sample_reg_masks;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_reg_masks;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
>>>>>>>>>> +{
>>>>>>>>>> +       return sample_reg_masks;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  const char *perf_reg_name(int id, const char *arch)
>>>>>>>>>>  {
>>>>>>>>>>         const char *reg_name = NULL;
>>>>>>>>>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>>>>>>>>>> index f2d0736d65cc..bce9c4cfd1bf 100644
>>>>>>>>>> --- a/tools/perf/util/perf_regs.h
>>>>>>>>>> +++ b/tools/perf/util/perf_regs.h
>>>>>>>>>> @@ -24,9 +24,20 @@ enum {
>>>>>>>>>>  };
>>>>>>>>>>
>>>>>>>>>>  int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>>>>>>>>> +bool arch_has_simd_regs(u64 mask);
>>>>>>>>>>  uint64_t arch__intr_reg_mask(void);
>>>>>>>>>>  uint64_t arch__user_reg_mask(void);
>>>>>>>>>>  const struct sample_reg *arch__sample_reg_masks(void);
>>>>>>>>>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>>>>>>>>>> +const struct sample_reg *arch__sample_pred_reg_masks(void);
>>>>>>>>> I wonder we can remove these functions. perf_reg_name(int id, uint16_t
>>>>>>>>> e_machine) maps a perf register number and e_machine to a string. So
>>>>>>>>> the sample_reg array could be replaced with:
>>>>>>>>> ```
>>>>>>>>> for (int perf_reg = 0; perf_reg < 64; perf_reg++) {
>>>>>>>>>   uint64_t mask = 1LL << perf_reg;
>>>>>>>>>   const char *name = perf_reg_name(perf_reg, EM_HOST);
>>>>>>>>>   if (name == NULL)
>>>>>>>>>     break;
>>>>>>>>>   // use mask and name
>>>>>>>>> ```
>>>>>>>>> To make it work for SIMD and PRED then I guess we need to iterate
>>>>>>>>> through the ABIs of enum perf_sample_regs_abi.
>>>>>>>> Suppose so.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>>>> +uint64_t arch__user_simd_reg_mask(void);
>>>>>>>>>> +uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>>>> +uint64_t arch__user_pred_reg_mask(void);
>>>>>>>>> I think some comments would be useful here like:
>>>>>>>>> ```
>>>>>>>>> /* Perf register bit map with valid bits for
>>>>>>>>> perf_event_attr.sample_regs_user. */
>>>>>>>>> uint64_t arch__intr_reg_mask(void);
>>>>>>>>> /* Perf register bit map with valid bits for
>>>>>>>>> perf_event_attr.sample_regs_intr. */
>>>>>>>>> uint64_t arch__user_reg_mask(void);
>>>>>>>>> /* Perf register bit map with valid bits for
>>>>>>>>> perf_event_attr.sample_simd_vec_reg_intr. */
>>>>>>>>> uint64_t arch__intr_simd_reg_mask(void);
>>>>>>>>> /* Perf register bit map with valid bits for
>>>>>>>>> perf_event_attr.sample_simd_vec_reg_user. */
>>>>>>>>> uint64_t arch__user_simd_reg_mask(void);
>>>>>>>>> /* Perf register bit map with valid bits for
>>>>>>>>> perf_event_attr.sample_simd_pred_reg_intr. */
>>>>>>>>> uint64_t arch__intr_pred_reg_mask(void);
>>>>>>>>> /* Perf register bit map with valid bits for
>>>>>>>>> perf_event_attr.sample_simd_pred_reg_user. */
>>>>>>>>> uint64_t arch__user_pred_reg_mask(void);
>>>>>>>> Sure. Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> Why do the arch__user_pred_reg_mask return a uint64_t when the
>>>>>>>>> perf_event_attr variable is a __u32?
>>>>>>>> Suppose it's a bug. :)
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> +uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> +uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>>> +uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
>>>>>>>>> I don't understand this function. The qwords is specific to a
>>>>>>>>> perf_event_attr. We could have an evlist with an evsel set up to
>>>>>>>>> sample say XMM registers and another evsel set up to sample ZMM
>>>>>>>>> registers. Are the qwords here always for the ZMM case, or is XMM,
>>>>>>>>> YMM, ZMM depending on architecture support? Why does it vary per
>>>>>>>>> register? The surrounding code uses the term mask but here bitmap is
>>>>>>>>> used, is the inconsistency deliberate? Why are there user and intr
>>>>>>>>> functions when in the perf_event_attr there are only
>>>>>>>>> sample_simd_pred_reg_qwords and sample_simd_ved_reg_qwords variables?
>>>>>>>> These 4 functions is designed to get the bitmask and qwords length for a
>>>>>>>> specific SIMD registers. E.g., For XMM on x86 platforms, the returned
>>>>>>>> bitmask is 0xffff (xmm0 ~ xmm15) and the qwords length is 2 (128 bits). For
>>>>>>>> ZMM on x86 platforms, if the platform only supports 16 ZMM registers, then
>>>>>>>> the returned bitmask is 0xffff (zmm0 ~ zmm15) and qwords length is 8 (512
>>>>>>>> bits). If the platform supports 32 ZMM registers, then the returned bitmask
>>>>>>>> is 0xffffffff (zmm0 ~ zmm31) and qwords length is 8 (512 bits).
>>>>>>> What is the meaning of reg? In this file it is normally the integer
>>>>>>> index for a bit in the sample_regs_user mask, but for x86 I don't see
>>>>>>> enum perf_event_x86_regs having differing XMM, YMM and ZMM encodings.
>>>>>>> Similarly, is qwords an out argument, but then you also have the
>>>>>>> bitmap. It looks like the code is caching values but that assumes a
>>>>>>> single qword length for all events.
>>>>>> Yes, the "reg" argument indicates the SIMD register index. Strictly
>>>>>> speaking for x86 platform, the qwords length is fixed for a specific SIMD
>>>>>> register and only the register number could vary, e.g., some platforms
>>>>>> could only support 16 ZMM registers, but some other platforms could support
>>>>>> 32 ZMM registers. But considering this is a generic function for all kinds
>>>>>> of archs, we can't ensure there are fixed length for a specific SIMD
>>>>>> register on any arch, so I introduce  the "qwords" argument to increase the
>>>>>> flexibility.
>>>>> I'm still not understanding this still :-) What is a "SIMD register
>>>>> index", the file is for perf registers and naturally enum
>>>>> perf_event_x86_regs on x86, but that doesn't encode YMM and ZMM
>>>>> registers. Perhaps you can give some examples?
>>>> Yes, it's something just like the register index in the enum
>>>> perf_event_x86_regs, e.g. the index of AX register is PERF_REG_X86_AX, the
>>>> index of BX is PERF_REG_X86_BX, and so on.
>>>>
>>>> But the difference is that each index in the perf_event_x86_regs can only
>>>> represent a u64 word. Assume we still want to represent the SIMD registers
>>>> with the perf_event_x86_regs enum, then each XMM register needs 2 indexes,
>>>> each YMM register needs 4 indexes and each ZMM needs 8 indexes. Considering
>>>> there are 16 XMM registers, 16 YMM registers and 32 ZMM registers. To
>>>> represent all these indexes, then the enum perf_event_x86_regs would become
>>>> quite large, and correspondingly the sample_regs_intr/sample_regs_user
>>>> fields in the perf_event_attr would have to inflate much. That would
>>>> consume much memory.
>>>>
>>>> So that's why we introduce the new below attributes.
>>>>
>>>> + union { + __u16 sample_simd_regs_enabled; + __u16
>>>> sample_simd_pred_reg_qwords; + }; + __u32 sample_simd_pred_reg_intr; +
>>>> __u32 sample_simd_pred_reg_user; + __u16 sample_simd_vec_reg_qwords; +
>>>> __u64 sample_simd_vec_reg_intr; + __u64 sample_simd_vec_reg_user; + __u32
>>>> __reserved_4; For SIMD registers, each kind of SIMD register would be
>>>> treated as a whole. The sample_simd_vec_reg_qwords would be used to
>>>> identify the length of SIMD register, simultaneously it also hint which
>>>> kind of SIMD register it is since the length of each kind of SIMD register
>>>> is different. E.g. we want to sample XMM registers. We know there are 16
>>>> XMM registers on the x86 platform and qwords length of XMM register is 2.
>>>> So user space needs to set the attributes like this,
>>>>
>>>> sample_simd_vec_reg_intr = 0xffff;
>>>>
>>>> sample_simd_vec_reg_qwords = 2;
>>>>
>>>> Come back to "reg" argument, we know there could be multiple kinds of SIMD
>>>> registers supported on some kind of arch, e.g., x86 support XMM, YMM, ZMM
>>>> and OPMASK SIMD registers. As each kind of SIMD register is always sampled
>>>> as a whole, we don't need to represent each of SIMD register, like XMM0,
>>>> XMM1, but we indeed need to distinguish different kinds of SIMD register,
>>>> like XMM and YMM registers, since they have different register length and
>>>> number.
>>>>
>>>> That's why we define the index for each kind of SIMD register, like below,
>>>>
>>>> +enum { + PERF_REG_X86_XMM, + PERF_REG_X86_YMM, + PERF_REG_X86_ZMM, +
>>>> PERF_REG_X86_MAX_SIMD_REGS, + + PERF_REG_X86_OPMASK = 0, +
>>>> PERF_REG_X86_MAX_PRED_REGS = 1, +}; It's similar withperf_event_x86_regs, but each index represents a kind of SIMD register instead of a specific SIMD register.
>>> Could you give me an example call to say
>>> arch__intr_simd_reg_bitmap_qwords where you say what the value of reg
>>> is, what the expected value of qwords is and what the result will be?
>>> Could you do it for say a model without AVX, a model with AVX, a model
>>> with AVX512 and a model with APX.
>> Assume we are on a x86 platform which only supports XMM registers (AVX) and
>> call the function arch__intr_simd_reg_bitmap_qwords() with SIMD register index,
>>
>> 1. reg = PERF_REG_X86_XMM
>>
>> The return value (XMM registers bitmask) = 0xffff and the qwords = 2 (128 bits).
> Thanks!
> Can we rename PERF_REG_X86_XMM to say PERF_REG_CLASS_X86_XMM
> (similarly reg to reg_class), currently the name is very close to
> PERF_REG_X86_XMM0 but that value is in a different enum.
> So the bitmask is in terms of the qwords whilst the regular perf
> register mask is 1 64-bit qword per bit.

Yes, good idea.


>
>> 2. reg = PERF_REG_X86_YMM
>>
>> The return value (YMM registers bitmask) = 0 and the qwords = 0 since YMM registers are not supported.
>>
>> 3. reg = PERF_REG_X86_ZMM
>>
>> The return value (ZMM registers bitmask) = 0 and the qwords = 0 since ZMM registers are not supported.
> Ok.
>
>> Assume we are on a x86 platform which supports XMM/YMM/ZMM registers (AVX512) and call the function arch__intr_simd_reg_bitmap_qwords() with SIMD register index,
>>
>> 1. reg = PERF_REG_X86_XMM
>>
>> The return value (XMM registers bitmask) = 0xffff and the qwords = 2 (128 bits).
>>
>> 2. reg = PERF_REG_X86_YMM
>>
>> The return value (YMM registers bitmask) = 0xffff and the qwords = 4 (256 bits).
> Ok, qwords got bigger.
>
>> 3. reg = PERF_REG_X86_ZMM
>>
>> The return value (ZMM registers bitmask) = 0xffffffff and the qwords = 8 (512 bits). We assume this platform supports 32 ZMM registers (ZMM0 ~ ZMM31).
> Wouldn't it then also support 32 YMM and XMM registers in the 2 cases above?

Actually not, current x86 platforms (at least for Intel) only support 16
XMM and YMM registers, some platforms only supports 16 ZMM registers (YMM
register is the bottom half of the ZMM registers). Some other new platforms
introduce additional 16 ZMM registers, but these additional ZMM registers
can only be accessed as a whole. Only accessing bottom half (YMM) or bottom
quarter (XMM) is not allowed.

>
>> As for APX, it has nothing to do with these 4 functions, whether it's supported is determined by the helpers arch__intr_reg_mask()/arch__user_reg_mask().
>> e.g.,
>>
>> ```
>>         if (has_cap_simd_regs()) {
>>                 mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>                                          GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>>                                          true);
>>                 mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>>                                          BIT_ULL(PERF_REG_X86_SSP),
>>                                          true);
>> ```
>> If the platform supports APX eGPRs, then the returned mask from arch__intr_reg_mask()/arch__user_reg_mask() would contain the eGPRs mask, otherwise, it would not.
> Thanks for the explanation. In the perf_regs.h why do YMMH and ZMMH
> also appear but not as arguments for arch__intr_simd_reg_bitmap_qwords

You mean the below enum?

enum {
    PERF_X86_OPMASK_QWORDS   = 1,
    PERF_X86_XMM_QWORDS      = 2,
    PERF_X86_YMMH_QWORDS     = 2,
    PERF_X86_YMM_QWORDS      = 4,
    PERF_X86_ZMMH_QWORDS     = 4,
    PERF_X86_ZMM_QWORDS      = 8,
    PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
};

PERF_X86_YMMH_QWORDS  and PERF_X86_ZMMH_QWORDS are only used to assembly
the whole value of YMM/ZMM registers in kernel. It won't be really used in
perf tools.

Actually Peter already suggested to remove these macros from uapi header
file, but I still didn't find a good place to define them, but at least
PERF_X86_YMMH_QWORDS and PERF_X86_ZMMH_QWORDS can be removed from this uapi
header file.


> ?
>
>>> I have looked at the code and read the changes to perf_event_attr
>>> which is why I was confused by your saying that ZMM could be passed in
>>> as a perf register number. I am confused as why when the
>>> perf_event_attr has 2 qword length related variables this code seems
>>> to be setting things up so that every register can have a qword
>>> length. I'm confused what is happening with the return value of this
>>> function. As values are being stored into global variables, and you
>>> are saying they aren't a max value, then how does this impact the
>>> setting up of multiple register sampling events?
>> The 2 qwords length, sample_simd_pred_reg_qwords is to store the PRED
>> register length, and sample_simd_vec_reg_qwords is to store the SIMD
>> register length. Since the SIMD/PRED registers with larger length would
>> contain the SIMD/PRED register with shorter length, so only the largest
>> length would be set  to into the variables
>> sample_simd_vec_reg_qwords/sample_simd_vec_reg_qwords .
>>
>> E.g.,
>>
>> perf record -e cyles:p -Ixmm,ymm,zmm -c 10000 -- sleep 1
>>
>> The sample_simd_vec_reg_qwords would be set to 8 to represent the largest
>> length (ZMM) and kernel directly samples ZMM registers since ZMM registers
>> fully contains YMM and XMM registers.
>>
>> The reason that caching bitmask and qwords is that the bitmask and qwords
>> for a specific SIMD/PRED register is fixed on a certain x86 platform,
>> right?  E.g. the qwords length of XMM register is always 2, YMM is 4, etc...
> Ok. My confusion is the overloaded meaning of a perf register in this
> file, hence it'd be nice to make the names more distinct.
>> The bitmask and qwords values are retrieved from kernel by
>> perf_event_open() syscall which is quite expensive, if it's called
>> frequently, it would impact the performance heavily.
> Agreed. It is a shame the existing probing/caching aren't used.
>
> Thanks,
> Ian
>
>>> Thanks,
>>> Ian
>>>
>>>>> How does the generic differing qword per register case get encoded
>>>>> into a perf_event_attr? If it can't be then this seems like
>>>>> functionality for no benefit. I also don't understand how the data in
>>>>> the PERF_SAMPLE_REGS_USER part of a sample could be decoded as that is
>>>>> assuming a constant qword number.
>>>>>
>>>>>> No, the qwords would be assigned to true register length if the register
>>>>>> exists on the platform, e.g., xmm = 2, ymm =  4 and zmm = 8. if the
>>>>>> register is not support on the platfom, the qwords would be set to 0.
>>>>> So it is a max function of the vector/pred qwords supported on the architecture.
>>>> Strictly speaking, it's not "max" function of the vector/pred qwords, it's
>>>> just a function to get the exact vector/pred qwords supported on the
>>>> architecture since qwords length won't vary for a fixed kind of SIMD register.
>>>>
>>>>
>>>>>>>> Since the qword length is always fixed for any certain SIMD register
>>>>>>>> regardless of intr or user, so there is only one
>>>>>>>> sample_simd_pred_reg_qwords or sample_simd_ved_reg_qwords variable.
>>>>>>> Ok.  2 variables, but 4 functions here. I think there should just be 2
>>>>>>> because of this.
>>>>>> Yes, the user and intr variants would be merged into only one.
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>>> Thanks,
>>>>>>> Ian
>>>>>>>
>>>>>>>>> Perhaps these functions should be something more like:
>>>>>>>>> ```
>>>>>>>>> /* Maximum value that can be assigned to
>>>>>>>>> perf_event_atttr.sample_simd_pred_reg_qwords. */
>>>>>>>>> uint16_t arch__simd_pred_reg_qwords_max(void);
>>>>>>>>> /* Maximum value that can be assigned to
>>>>>>>>> perf_event_atttr.sample_simd_vec_reg_qwords. */
>>>>>>>>> uint16_t arch__simd_vec_reg_qwords_max(void);
>>>>>>>>> ```
>>>>>>>>> Then the bitmap computation logic can all be moved into parse-regs-options.c.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ian
>>>>>>>>>
>>>>>>>>>>  const char *perf_reg_name(int id, const char *arch);
>>>>>>>>>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>>>>>>>>>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>>>>>>>>>> index ea3a6c4657ee..825ffb4cc53f 100644
>>>>>>>>>> --- a/tools/perf/util/record.h
>>>>>>>>>> +++ b/tools/perf/util/record.h
>>>>>>>>>> @@ -59,7 +59,13 @@ struct record_opts {
>>>>>>>>>>         unsigned int  user_freq;
>>>>>>>>>>         u64           branch_stack;
>>>>>>>>>>         u64           sample_intr_regs;
>>>>>>>>>> +       u64           sample_intr_vec_regs;
>>>>>>>>>>         u64           sample_user_regs;
>>>>>>>>>> +       u64           sample_user_vec_regs;
>>>>>>>>>> +       u16           sample_pred_regs_qwords;
>>>>>>>>>> +       u16           sample_vec_regs_qwords;
>>>>>>>>>> +       u16           sample_intr_pred_regs;
>>>>>>>>>> +       u16           sample_user_pred_regs;
>>>>>>>>>>         u64           default_interval;
>>>>>>>>>>         u64           user_interval;
>>>>>>>>>>         size_t        auxtrace_snapshot_size;
>>>>>>>>>> --
>>>>>>>>>> 2.34.1
>>>>>>>>>>

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2026-01-22  8:29 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-03  6:54 [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Dapeng Mi
2025-12-03  6:54 ` [Patch v5 01/19] perf: Eliminate duplicate arch-specific functions definations Dapeng Mi
2025-12-03  6:54 ` [Patch v5 02/19] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
2025-12-03  6:54 ` [Patch v5 03/19] perf/x86: Introduce x86-specific x86_pmu_setup_regs_data() Dapeng Mi
2025-12-03  6:54 ` [Patch v5 04/19] x86/fpu/xstate: Add xsaves_nmi() helper Dapeng Mi
2025-12-03  6:54 ` [Patch v5 05/19] perf: Move and rename has_extended_regs() for ARCH-specific use Dapeng Mi
2025-12-03  6:54 ` [Patch v5 06/19] perf/x86: Add support for XMM registers in non-PEBS and REGS_USER Dapeng Mi
2025-12-04 15:17   ` Peter Zijlstra
2025-12-04 15:47     ` Peter Zijlstra
2025-12-05  6:37       ` Mi, Dapeng
2025-12-04 18:59     ` Dave Hansen
2025-12-05  8:42       ` Peter Zijlstra
2025-12-03  6:54 ` [Patch v5 07/19] perf: Add sampling support for SIMD registers Dapeng Mi
2025-12-05 11:07   ` Peter Zijlstra
2025-12-08  5:24     ` Mi, Dapeng
2025-12-05 11:40   ` Peter Zijlstra
2025-12-08  6:00     ` Mi, Dapeng
2025-12-03  6:54 ` [Patch v5 08/19] perf/x86: Enable XMM sampling using sample_simd_vec_reg_* fields Dapeng Mi
2025-12-05 11:25   ` Peter Zijlstra
2025-12-08  6:10     ` Mi, Dapeng
2025-12-03  6:54 ` [Patch v5 09/19] perf/x86: Enable YMM " Dapeng Mi
2025-12-03  6:54 ` [Patch v5 10/19] perf/x86: Enable ZMM " Dapeng Mi
2025-12-03  6:54 ` [Patch v5 11/19] perf/x86: Enable OPMASK sampling using sample_simd_pred_reg_* fields Dapeng Mi
2025-12-03  6:54 ` [Patch v5 12/19] perf/x86: Enable eGPRs sampling using sample_regs_* fields Dapeng Mi
2025-12-05 12:16   ` Peter Zijlstra
2025-12-08  6:11     ` Mi, Dapeng
2025-12-03  6:54 ` [Patch v5 13/19] perf/x86: Enable SSP " Dapeng Mi
2025-12-05 12:20   ` Peter Zijlstra
2025-12-08  6:21     ` Mi, Dapeng
2025-12-24  5:45   ` Ravi Bangoria
2025-12-24  6:26     ` Mi, Dapeng
2026-01-06  6:55       ` Mi, Dapeng
2025-12-03  6:54 ` [Patch v5 14/19] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability Dapeng Mi
2025-12-03  6:54 ` [Patch v5 15/19] perf/x86/intel: Enable arch-PEBS based SIMD/eGPRs/SSP sampling Dapeng Mi
2025-12-03  6:54 ` [Patch v5 16/19] perf/x86: Activate back-to-back NMI detection for arch-PEBS induced NMIs Dapeng Mi
2025-12-05 12:39   ` Peter Zijlstra
2025-12-07 20:44     ` Andi Kleen
2025-12-08  6:46     ` Mi, Dapeng
2025-12-08  8:50       ` Peter Zijlstra
2025-12-08  8:53         ` Mi, Dapeng
2025-12-03  6:54 ` [Patch v5 17/19] perf headers: Sync with the kernel headers Dapeng Mi
2025-12-03 23:43   ` Ian Rogers
2025-12-04  1:37     ` Mi, Dapeng
2025-12-04  7:28       ` Ian Rogers
2026-01-20  7:01   ` Ian Rogers
2026-01-20  7:25     ` Mi, Dapeng
2026-01-20  7:16   ` Ian Rogers
2026-01-20  7:43     ` Mi, Dapeng
2026-01-20  8:00       ` Ian Rogers
2026-01-20  9:22         ` Mi, Dapeng
2026-01-20 18:11           ` Ian Rogers
2026-01-21  2:03             ` Mi, Dapeng
2025-12-03  6:54 ` [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format Dapeng Mi
2025-12-04  0:17   ` Ian Rogers
2025-12-04  2:58     ` Mi, Dapeng
2025-12-04  7:49       ` Ian Rogers
2025-12-04  9:20         ` Mi, Dapeng
2025-12-04 16:16           ` Ian Rogers
2025-12-05  4:00             ` Mi, Dapeng
2025-12-05  6:38               ` Ian Rogers
2025-12-05  8:10                 ` Mi, Dapeng
2025-12-05 16:35                   ` Ian Rogers
2025-12-08  4:20                     ` Mi, Dapeng
2026-01-06  7:27                       ` Mi, Dapeng
2026-01-17  5:50                         ` Ian Rogers
2026-01-19  6:55                           ` Mi, Dapeng
2026-01-19 20:25                             ` Ian Rogers
2026-01-20  3:04                               ` Mi, Dapeng
2026-01-20  5:16                                 ` Ian Rogers
2026-01-20  6:46                                   ` Mi, Dapeng
2026-01-20  6:56                                     ` Ian Rogers
2026-01-20  7:39   ` Ian Rogers
2026-01-20  9:04     ` Mi, Dapeng
2026-01-20 18:20       ` Ian Rogers
2026-01-21  5:17         ` Mi, Dapeng
2026-01-21  7:09           ` Ian Rogers
2026-01-21  7:52             ` Mi, Dapeng
2026-01-21 14:48               ` Ian Rogers
2026-01-22  1:49                 ` Mi, Dapeng
2026-01-22  7:27                   ` Ian Rogers
2026-01-22  8:29                     ` Mi, Dapeng
2025-12-03  6:55 ` [Patch v5 19/19] perf regs: Enable dumping of SIMD registers Dapeng Mi
2025-12-04  0:24 ` [Patch v5 00/19] Support SIMD/eGPRs/SSP registers sampling for perf Ian Rogers
2025-12-04  3:28   ` Mi, Dapeng
2025-12-16  4:42 ` Ravi Bangoria
2025-12-16  6:59   ` Mi, Dapeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox