linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch v4 00/17] Support vector and more extended registers in perf
@ 2025-09-25  6:11 Dapeng Mi
  2025-09-25  6:11 ` [Patch v4 01/17] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
                   ` (16 more replies)
  0 siblings, 17 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:11 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

Changes since V3:
- Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
- Only dump the available regs, rather than zero and dump the
  unavailable regs. It's possible that the dumped registers are a subset
  of the requested registers.
- Some minor updates to address Dapeng's comments in V3.

Changes since V2:
- Use the FPU format for the x86_pmu.ext_regs_mask as well
- Add a check before invoking xsaves_nmi()
- Add perf_simd_reg_check() to retrieve the number of available
  registers. If the kernel fails to get the requested registers, e.g.,
  XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
- Add POC perf tool patches

Changes since V1:
- Apply the new interfaces to configure and dump the SIMD registers
- Utilize the existing FPU functions, e.g., xstate_calculate_size,
  get_xsave_addr().

Starting from the Intel Ice Lake, the XMM registers can be collected in
a PEBS record. More registers, e.g., YMM, ZMM, OPMASK, SPP and APX, will
be added in the upcoming Architecture PEBS as well. But it requires the
hardware support.

The patch set provides a software solution to mitigate the hardware
requirement. It utilizes the XSAVES command to retrieve the requested
registers in the overflow handler. The feature isn't limited to the PEBS
event or specific platforms anymore.
The hardware solution (if available) is still preferred, since it has
low overhead (especially with the large PEBS) and is more accurate.

In theory, the solution should work for all X86 platforms. But I only
have newer Inter platforms to test. The patch set only enable the
feature for Intel Ice Lake and later platforms.

The new registers include YMM, ZMM, OPMASK, SSP, and APX.
The sample_regs_user/intr has run out. A new field in the
struct perf_event_attr is required for the registers.

After a long discussion in V1,
https://lore.kernel.org/lkml/3f1c9a9e-cb63-47ff-a5e9-06555fa6cc9a@linux.intel.com/

The new field looks like as below.
@@ -543,6 +545,25 @@ struct perf_event_attr {
        __u64   sig_data;

        __u64   config3; /* extension of config2 */
+
+
+       /*
+        * Defines set of SIMD registers to dump on samples.
+        * The sample_simd_regs_enabled !=0 implies the
+        * set of SIMD registers is used to config all SIMD registers.
+        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+        * config some SIMD registers on X86.
+        */
+       union {
+               __u16 sample_simd_regs_enabled;
+               __u16 sample_simd_pred_reg_qwords;
+       };
+       __u32 sample_simd_pred_reg_intr;
+       __u32 sample_simd_pred_reg_user;
+       __u16 sample_simd_vec_reg_qwords;
+       __u64 sample_simd_vec_reg_intr;
+       __u64 sample_simd_vec_reg_user;
+       __u32 __reserved_4;
 };
@@ -1016,7 +1037,15 @@ enum perf_event_type {
         *      } && PERF_SAMPLE_BRANCH_STACK
         *
         *      { u64                   abi; # enum perf_sample_regs_abi
-        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;
+        *              u16 vector_qwords;
+        *              u16 nr_pred;
+        *              u16 pred_qwords;
+        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_USER
         *
         *      { u64                   size;
         *        char                  data[size];
@@ -1043,7 +1072,15 @@ enum perf_event_type {
         *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
         *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
         *      { u64                   abi; # enum perf_sample_regs_abi
-        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;
+        *              u16 vector_qwords;
+        *              u16 nr_pred;
+        *              u16 pred_qwords;
+        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_INTR
         *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
         *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
         *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE


Since there is only one vector qwords field, the qwords for the newest
vector should be set by the tools. For example, if the end user wants
XMM0 and YMM1, the vector qwords should be 4. The vector mask should be
0x3. The YMM0 and YMM1 will be dumped to the userspace. It's the tool's
responsibility to output the XMM0 and YMM1 to the end user.

The POC perf tool patches for testing purposes is also attached.

Examples:
 $perf record -I?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7

 $perf record --user-regs=?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7

 $perf record -e cycles:p -IXMM,YMM,OPMASK,SSP ./test
 $perf report -D
 ... ...
 237538985992962 0x454d0 [0x480]: PERF_RECORD_SAMPLE(IP, 0x1):
 179370/179370: 0xffffffff969627fc period: 124999 addr: 0
 ... intr regs: mask 0x20000000000 ABI 64-bit
 .... SSP   0x0000000000000000
 ... SIMD ABI nr_vectors 32 vector_qwords 4 nr_pred 8 pred_qwords 1
 .... YMM  [0] 0x0000000000004000
 .... YMM  [0] 0x000055e828695270
 .... YMM  [0] 0x0000000000000000
 .... YMM  [0] 0x0000000000000000
 .... YMM  [1] 0x000055e8286990e0
 .... YMM  [1] 0x000055e828698dd0
 .... YMM  [1] 0x0000000000000000
 .... YMM  [1] 0x0000000000000000
 ... ...
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... OPMASK[0] 0x0000000000100221
 .... OPMASK[1] 0x0000000000000020
 .... OPMASK[2] 0x000000007fffffff
 .... OPMASK[3] 0x0000000000000000
 .... OPMASK[4] 0x0000000000000000
 .... OPMASK[5] 0x0000000000000000
 .... OPMASK[6] 0x0000000000000000
 .... OPMASK[7] 0x0000000000000000
 ... ...


History:
  v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
  v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
  v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/

Kan Liang (17):
  perf/x86: Use x86_perf_regs in the x86 nmi handler
  perf/x86: Setup the regs data
  x86/fpu/xstate: Add xsaves_nmi
  perf: Move has_extended_regs() to header file
  perf/x86: Support XMM register for non-PEBS and REGS_USER
  perf: Support SIMD registers
  perf/x86: Move XMM to sample_simd_vec_regs
  perf/x86: Add YMM into sample_simd_vec_regs
  perf/x86: Add ZMM into sample_simd_vec_regs
  perf/x86: Add OPMASK into sample_simd_pred_reg
  perf/x86: Add eGPRs into sample_regs
  perf/x86: Add SSP into sample_regs
  perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS
  perf tools: Only support legacy regs for the PT and PERF_REGS_MASK
  perf tools: headers: Sync with the kernel sources
  perf tools: parse-regs: Support the new SIMD format
  perf tools: regs: Support to dump regs for PERF_SAMPLE_REGS_ABI_SIMD

 arch/x86/events/core.c                        | 315 ++++++++++++-
 arch/x86/events/intel/core.c                  |  75 ++-
 arch/x86/events/intel/ds.c                    |  12 +-
 arch/x86/events/perf_event.h                  |  80 ++++
 arch/x86/include/asm/fpu/xstate.h             |   3 +
 arch/x86/include/asm/perf_event.h             |  30 +-
 arch/x86/include/uapi/asm/perf_regs.h         |  65 ++-
 arch/x86/kernel/fpu/xstate.c                  |  32 +-
 arch/x86/kernel/perf_regs.c                   | 139 +++++-
 include/linux/perf_event.h                    |  16 +
 include/linux/perf_regs.h                     |  26 +
 include/uapi/linux/perf_event.h               |  45 +-
 kernel/events/core.c                          | 111 ++++-
 tools/arch/x86/include/uapi/asm/perf_regs.h   |  65 ++-
 tools/include/uapi/linux/perf_event.h         |  45 +-
 tools/perf/arch/x86/include/perf_regs.h       |   2 +-
 tools/perf/arch/x86/util/perf_regs.c          | 443 +++++++++++++++++-
 tools/perf/util/evsel.c                       |  45 ++
 tools/perf/util/intel-pt.c                    |   2 +-
 tools/perf/util/parse-regs-options.c          | 133 +++++-
 .../perf/util/perf-regs-arch/perf_regs_x86.c  |  43 ++
 tools/perf/util/perf_event_attr_fprintf.c     |   6 +
 tools/perf/util/perf_regs.c                   |  54 +++
 tools/perf/util/perf_regs.h                   |  10 +
 tools/perf/util/record.h                      |   6 +
 tools/perf/util/sample.h                      |  10 +
 tools/perf/util/session.c                     |  78 ++-
 27 files changed, 1814 insertions(+), 77 deletions(-)


base-commit: 6d48436560e91be858158e227f21aab71698814e
-- 
2.34.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Patch v4 01/17] perf/x86: Use x86_perf_regs in the x86 nmi handler
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
@ 2025-09-25  6:11 ` Dapeng Mi
  2025-09-25  6:11 ` [Patch v4 02/17] perf/x86: Setup the regs data Dapeng Mi
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:11 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

More and more regs will be supported in the overflow, e.g., more vector
registers, SSP, etc. The generic pt_regs struct cannot store all of
them. Use a X86 specific x86_perf_regs instead.

The struct pt_regs *regs is still passed to x86_pmu_handle_irq(). There
is no functional change for the existing code.

AMD IBS's NMI handler doesn't utilize the static call
x86_pmu_handle_irq(). The x86_perf_regs struct doesn't apply to the AMD
IBS. It can be added separately later when AMD IBS supports more regs.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 745caa6c15a3..f4afef16cbab 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1752,6 +1752,7 @@ void perf_events_lapic_init(void)
 static int
 perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
+	struct x86_perf_regs x86_regs;
 	u64 start_clock;
 	u64 finish_clock;
 	int ret;
@@ -1764,7 +1765,8 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 		return NMI_DONE;
 
 	start_clock = sched_clock();
-	ret = static_call(x86_pmu_handle_irq)(regs);
+	x86_regs.regs = *regs;
+	ret = static_call(x86_pmu_handle_irq)(&x86_regs.regs);
 	finish_clock = sched_clock();
 
 	perf_sample_event_took(finish_clock - start_clock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 02/17] perf/x86: Setup the regs data
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
  2025-09-25  6:11 ` [Patch v4 01/17] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
@ 2025-09-25  6:11 ` Dapeng Mi
  2025-09-25  6:11 ` [Patch v4 03/17] x86/fpu/xstate: Add xsaves_nmi Dapeng Mi
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:11 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The current code relies on the generic code to setup the regs data.
It will not work well when there are more regs introduced.
Introduce a X86-specific x86_pmu_setup_regs_data().
Now, it's the same as the generic code. More X86-specific codes will be
added later when the new regs.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c       | 32 ++++++++++++++++++++++++++++++++
 arch/x86/events/intel/ds.c   |  4 +++-
 arch/x86/events/perf_event.h |  4 ++++
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f4afef16cbab..92678f61f819 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1685,6 +1685,38 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs)
+{
+	u64 sample_type = event->attr.sample_type;
+
+	if (sample_type & PERF_SAMPLE_REGS_USER) {
+		if (user_mode(regs)) {
+			data->regs_user.abi = perf_reg_abi(current);
+			data->regs_user.regs = regs;
+		} else if (!(current->flags & PF_KTHREAD)) {
+			perf_get_regs_user(&data->regs_user, regs);
+		} else {
+			data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
+			data->regs_user.regs = NULL;
+		}
+		data->dyn_size += sizeof(u64);
+		if (data->regs_user.regs)
+			data->dyn_size += hweight64(event->attr.sample_regs_user) * sizeof(u64);
+		data->sample_flags |= PERF_SAMPLE_REGS_USER;
+	}
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR) {
+		data->regs_intr.regs = regs;
+		data->regs_intr.abi = perf_reg_abi(current);
+		data->dyn_size += sizeof(u64);
+		if (data->regs_intr.regs)
+			data->dyn_size += hweight64(event->attr.sample_regs_intr) * sizeof(u64);
+		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
+	}
+}
+
 int x86_pmu_handle_irq(struct pt_regs *regs)
 {
 	struct perf_sample_data data;
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index c0b7ac1c7594..e67d8a03ddfe 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2126,8 +2126,10 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 			regs->flags &= ~PERF_EFLAGS_EXACT;
 		}
 
-		if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
+		if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
 			adaptive_pebs_save_regs(regs, gprs);
+			x86_pmu_setup_regs_data(event, data, regs);
+		}
 	}
 
 	if (format_group & PEBS_DATACFG_MEMINFO) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 2b969386dcdd..12682a059608 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1278,6 +1278,10 @@ void x86_pmu_enable_event(struct perf_event *event);
 
 int x86_pmu_handle_irq(struct pt_regs *regs);
 
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs);
+
 void x86_pmu_show_pmu_cap(struct pmu *pmu);
 
 static inline int x86_pmu_num_counters(struct pmu *pmu)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 03/17] x86/fpu/xstate: Add xsaves_nmi
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
  2025-09-25  6:11 ` [Patch v4 01/17] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
  2025-09-25  6:11 ` [Patch v4 02/17] perf/x86: Setup the regs data Dapeng Mi
@ 2025-09-25  6:11 ` Dapeng Mi
  2025-09-25 15:07   ` Dave Hansen
  2025-09-25  6:12 ` [Patch v4 04/17] perf: Move has_extended_regs() to header file Dapeng Mi
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:11 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

There is a hardware feature (Intel PEBS XMMs group), which can handle
XSAVE "snapshots" from random code running. This just provides another
XSAVE data source at a random time.

Add an interface to retrieve the actual register contents when the NMI
hit. The interface is different from the other interfaces of FPU. The
other mechanisms that deal with xstate try to get something coherent.
But this interface is *in*coherent. There's no telling what was in the
registers when a NMI hits. It writes whatever was in the registers when
the NMI hit. It's the invoker's responsibility to make sure the contents
are properly filtered before exposing them to the end user.

The support of the supervisor state components is required. The
compacted storage format is preferred. So the XSAVES is used.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/fpu/xstate.h |  1 +
 arch/x86/kernel/fpu/xstate.c      | 30 ++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 7a7dc9d56027..38fa8ff26559 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -110,6 +110,7 @@ int xfeature_size(int xfeature_nr);
 
 void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
+void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
 int xfd_enable_feature(u64 xfd_err);
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 12ed75c1b567..1ef62a137769 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1474,6 +1474,36 @@ void xrstors(struct xregs_state *xstate, u64 mask)
 	WARN_ON_ONCE(err);
 }
 
+/**
+ * xsaves_nmi - Save selected components to a kernel xstate buffer in NMI
+ * @xstate:	Pointer to the buffer
+ * @mask:	Feature mask to select the components to save
+ *
+ * The @xstate buffer must be 64 byte aligned.
+ *
+ * Caution: The interface is different from the other interfaces of FPU.
+ * The other mechanisms that deal with xstate try to get something coherent.
+ * But this interface is *in*coherent. There's no telling what was in the
+ * registers when a NMI hits. It writes whatever was in the registers when
+ * the NMI hit.
+ * The only user for the interface is perf_event. There is already a
+ * hardware feature (See Intel PEBS XMMs group), which can handle XSAVE
+ * "snapshots" from random code running. This just provides another XSAVE
+ * data source at a random time.
+ * This function can only be invoked in an NMI. It returns the *ACTUAL*
+ * register contents when the NMI hit.
+ */
+void xsaves_nmi(struct xregs_state *xstate, u64 mask)
+{
+	int err;
+
+	if (!in_nmi())
+		return;
+
+	XSTATE_OP(XSAVES, xstate, (u32)mask, (u32)(mask >> 32), err);
+	WARN_ON_ONCE(err);
+}
+
 #if IS_ENABLED(CONFIG_KVM)
 void fpstate_clear_xstate_component(struct fpstate *fpstate, unsigned int xfeature)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 04/17] perf: Move has_extended_regs() to header file
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (2 preceding siblings ...)
  2025-09-25  6:11 ` [Patch v4 03/17] x86/fpu/xstate: Add xsaves_nmi Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER Dapeng Mi
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The function will also be used in the ARCH-specific code.

Rename it to follow the naming rule of the existing functions.

No functional change.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 include/linux/perf_event.h | 8 ++++++++
 kernel/events/core.c       | 8 +-------
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fd1d91017b99..1a647a1e6d08 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1526,6 +1526,14 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_extended_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return (attr->sample_regs_user & PERF_REG_EXTENDED_MASK) ||
+	       (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK);
+}
+
 static inline bool event_has_any_exclude_flag(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 28de3baff792..fe3a01cc4d92 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12560,12 +12560,6 @@ int perf_pmu_unregister(struct pmu *pmu)
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
-static inline bool has_extended_regs(struct perf_event *event)
-{
-	return (event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK) ||
-	       (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK);
-}
-
 static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 {
 	struct perf_event_context *ctx = NULL;
@@ -12600,7 +12594,7 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 		goto err_pmu;
 
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
-	    has_extended_regs(event)) {
+	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
 		goto err_destroy;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (3 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 04/17] perf: Move has_extended_regs() to header file Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 06/17] perf: Support SIMD registers Dapeng Mi
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Collecting the XMM registers in a PEBS record has been supported since
the Icelake. But non-PEBS events don't support the feature. It's
possible to retrieve the XMM registers from the XSAVE for non-PEBS.
Add it to make the feature complete.

To utilize the XSAVE, a 64-byte aligned buffer is required. Add a
per-CPU ext_regs_buf to store the vector registers. The size of the
buffer is ~2K. kzalloc_node() is used because there's a _guarantee_
that all kmalloc()'s with powers of 2 are naturally aligned and also
64b aligned.

Extend the support for both REGS_USER and REGS_INTR. For REGS_USER, the
perf_get_regs_user() returns the regs from the task_pt_regs(current),
which is struct pt_regs. Need to move it to local struct x86_perf_regs
x86_user_regs.
For PEBS, the HW support is still preferred. The XMM should be retrieved
from PEBS records.

It's possible that a userspace application enters the kernel mode, e.g.,
syscall, when an NMI is triggered. The pt_regs information can still be
retrieved from task_pt_regs(). But, the SIMD registers cannot.
The SIMD registers are dropped for this case. Because,
- To profile a userspace application which requires SIMD/eGPR regs, the
  NMI usually hits the usersapce. It's not common to hit the kernel mode.
- Although it's possible to retrieve the values when the
  TIF_NEED_FPU_LOAD flag is set, it's not worth introducing such
  complexity to handle an uncommon case in the critical path.
- Furthermore, only checking the TIF_NEED_FPU_LOAD flag cannot cure
  everything. Some corner cases cannot be handled either. For example,
  an NMI can happen when the flag just switched, but still in the kernel
  mode.

There could be more vector registers supported later. Add ext_regs_mask
to track the supported vector register group.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c            | 142 ++++++++++++++++++++++++++----
 arch/x86/events/intel/core.c      |  27 ++++++
 arch/x86/events/intel/ds.c        |  10 ++-
 arch/x86/events/perf_event.h      |   9 +-
 arch/x86/include/asm/fpu/xstate.h |   2 +
 arch/x86/include/asm/perf_event.h |   5 +-
 arch/x86/kernel/fpu/xstate.c      |   2 +-
 7 files changed, 172 insertions(+), 25 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 92678f61f819..e363f5f2b37d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -406,6 +406,61 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event)
 	return x86_pmu_extra_regs(val, event);
 }
 
+static DEFINE_PER_CPU(struct xregs_state *, ext_regs_buf);
+
+static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
+{
+	struct xregs_state *xsave = per_cpu(ext_regs_buf, smp_processor_id());
+	u64 valid_mask = x86_pmu.ext_regs_mask & mask;
+
+	if (WARN_ON_ONCE(!xsave))
+		return;
+
+	xsaves_nmi(xsave, valid_mask);
+
+	/* Filtered by what XSAVE really gives */
+	valid_mask &= xsave->header.xfeatures;
+
+	if (valid_mask & XFEATURE_MASK_SSE)
+		perf_regs->xmm_space = xsave->i387.xmm_space;
+}
+
+static void release_ext_regs_buffers(void)
+{
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		kfree(per_cpu(ext_regs_buf, cpu));
+		per_cpu(ext_regs_buf, cpu) = NULL;
+	}
+}
+
+static void reserve_ext_regs_buffers(void)
+{
+	unsigned int size;
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	size = xstate_calculate_size(x86_pmu.ext_regs_mask, true);
+
+	for_each_possible_cpu(cpu) {
+		per_cpu(ext_regs_buf, cpu) = kzalloc_node(size, GFP_KERNEL,
+							  cpu_to_node(cpu));
+		if (!per_cpu(ext_regs_buf, cpu))
+			goto err;
+	}
+
+	return;
+
+err:
+	release_ext_regs_buffers();
+}
+
 int x86_reserve_hardware(void)
 {
 	int err = 0;
@@ -418,6 +473,7 @@ int x86_reserve_hardware(void)
 			} else {
 				reserve_ds_buffers();
 				reserve_lbr_buffers();
+				reserve_ext_regs_buffers();
 			}
 		}
 		if (!err)
@@ -434,6 +490,7 @@ void x86_release_hardware(void)
 		release_pmc_hardware();
 		release_ds_buffers();
 		release_lbr_buffers();
+		release_ext_regs_buffers();
 		mutex_unlock(&pmc_reserve_mutex);
 	}
 }
@@ -642,19 +699,17 @@ int x86_pmu_hw_config(struct perf_event *event)
 			return -EINVAL;
 	}
 
-	/* sample_regs_user never support XMM registers */
-	if (unlikely(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK))
-		return -EINVAL;
-	/*
-	 * Besides the general purpose registers, XMM registers may
-	 * be collected in PEBS on some platforms, e.g. Icelake
-	 */
-	if (unlikely(event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK)) {
-		if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
-			return -EINVAL;
-
-		if (!event->attr.precise_ip)
-			return -EINVAL;
+	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
+		/*
+		 * Besides the general purpose registers, XMM registers may
+		 * be collected as well.
+		 */
+		if (event_has_extended_regs(event)) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+				return -EINVAL;
+			if (!event->attr.precise_ip)
+				return -EINVAL;
+		}
 	}
 
 	return x86_setup_perfctr(event);
@@ -1685,25 +1740,65 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
+static DEFINE_PER_CPU(struct x86_perf_regs, x86_user_regs);
+
+static struct x86_perf_regs *
+x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
+			   struct pt_regs *regs)
+{
+	struct x86_perf_regs *x86_regs_user = this_cpu_ptr(&x86_user_regs);
+	struct perf_regs regs_user;
+
+	perf_get_regs_user(&regs_user, regs);
+	data->regs_user.abi = regs_user.abi;
+	if (regs_user.regs) {
+		x86_regs_user->regs = *regs_user.regs;
+		data->regs_user.regs = &x86_regs_user->regs;
+	} else
+		data->regs_user.regs = NULL;
+	return x86_regs_user;
+}
+
+static bool x86_pmu_user_req_pt_regs_only(struct perf_event *event)
+{
+	return !(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK);
+}
+
 void x86_pmu_setup_regs_data(struct perf_event *event,
 			     struct perf_sample_data *data,
-			     struct pt_regs *regs)
+			     struct pt_regs *regs,
+			     u64 ignore_mask)
 {
-	u64 sample_type = event->attr.sample_type;
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+	struct perf_event_attr *attr = &event->attr;
+	u64 sample_type = attr->sample_type;
+	u64 mask = 0;
+
+	if (!(attr->sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)))
+		return;
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
 		if (user_mode(regs)) {
 			data->regs_user.abi = perf_reg_abi(current);
 			data->regs_user.regs = regs;
-		} else if (!(current->flags & PF_KTHREAD)) {
-			perf_get_regs_user(&data->regs_user, regs);
+		} else if (!(current->flags & PF_KTHREAD) &&
+			   x86_pmu_user_req_pt_regs_only(event)) {
+			/*
+			 * It cannot guarantee that the kernel will never
+			 * touch the registers outside of the pt_regs,
+			 * especially when more and more registers
+			 * (e.g., SIMD, eGPR) are added. The live data
+			 * cannot be used.
+			 * Dump the registers when only pt_regs are required.
+			 */
+			perf_regs = x86_pmu_perf_get_regs_user(data, regs);
 		} else {
 			data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
 			data->regs_user.regs = NULL;
 		}
 		data->dyn_size += sizeof(u64);
 		if (data->regs_user.regs)
-			data->dyn_size += hweight64(event->attr.sample_regs_user) * sizeof(u64);
+			data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
 		data->sample_flags |= PERF_SAMPLE_REGS_USER;
 	}
 
@@ -1712,9 +1807,18 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 		data->regs_intr.abi = perf_reg_abi(current);
 		data->dyn_size += sizeof(u64);
 		if (data->regs_intr.regs)
-			data->dyn_size += hweight64(event->attr.sample_regs_intr) * sizeof(u64);
+			data->dyn_size += hweight64(attr->sample_regs_intr) * sizeof(u64);
 		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
 	}
+
+	if (event_has_extended_regs(event)) {
+		perf_regs->xmm_regs = NULL;
+		mask |= XFEATURE_MASK_SSE;
+	}
+
+	mask &= ~ignore_mask;
+	if (mask)
+		x86_pmu_get_ext_regs(perf_regs, mask);
 }
 
 int x86_pmu_handle_irq(struct pt_regs *regs)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 28f5468a6ea3..2575ec0d2b77 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3281,6 +3281,8 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 		if (has_branch_stack(event))
 			intel_pmu_lbr_save_brstack(&data, cpuc, event);
 
+		x86_pmu_setup_regs_data(event, &data, regs, 0);
+
 		perf_event_overflow(event, &data, regs);
 	}
 
@@ -5269,6 +5271,29 @@ static inline bool intel_pmu_broken_perf_cap(void)
 	return false;
 }
 
+static void intel_extended_regs_init(struct pmu *pmu)
+{
+	/*
+	 * Extend the vector registers support to non-PEBS.
+	 * The feature is limited to newer Intel machines with
+	 * PEBS V4+ or archPerfmonExt (0x23) enabled for now.
+	 * In theory, the vector registers can be retrieved as
+	 * long as the CPU supports. The support for the old
+	 * generations may be added later if there is a
+	 * requirement.
+	 * Only support the extension when XSAVES is available.
+	 */
+	if (!boot_cpu_has(X86_FEATURE_XSAVES))
+		return;
+
+	if (!boot_cpu_has(X86_FEATURE_XMM) ||
+	    !cpu_has_xfeatures(XFEATURE_MASK_SSE, NULL))
+		return;
+
+	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
+	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+}
+
 static void update_pmu_cap(struct pmu *pmu)
 {
 	unsigned int cntr, fixed_cntr, ecx, edx;
@@ -5303,6 +5328,8 @@ static void update_pmu_cap(struct pmu *pmu)
 		/* Perf Metric (Bit 15) and PEBS via PT (Bit 16) are hybrid enumeration */
 		rdmsrq(MSR_IA32_PERF_CAPABILITIES, hybrid(pmu, intel_cap).capabilities);
 	}
+
+	intel_extended_regs_init(pmu);
 }
 
 static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index e67d8a03ddfe..f95dfee6adb2 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1415,8 +1415,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
-	    (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
+	if (event_has_extended_regs(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
@@ -2127,8 +2126,12 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		}
 
 		if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
+			u64 ignore_mask = 0;
+
+			if (format_group & PEBS_DATACFG_XMMS)
+				ignore_mask |= XFEATURE_MASK_SSE;
 			adaptive_pebs_save_regs(regs, gprs);
-			x86_pmu_setup_regs_data(event, data, regs);
+			x86_pmu_setup_regs_data(event, data, regs, ignore_mask);
 		}
 	}
 
@@ -2755,6 +2758,7 @@ void __init intel_pebs_init(void)
 				x86_pmu.flags |= PMU_FL_PEBS_ALL;
 				x86_pmu.pebs_capable = ~0ULL;
 				pebs_qual = "-baseline";
+				x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
 				x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
 			} else {
 				/* Only basic record supported */
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 12682a059608..7bf24842b1dc 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -992,6 +992,12 @@ struct x86_pmu {
 	struct extra_reg *extra_regs;
 	unsigned int flags;
 
+	/*
+	 * Extended regs, e.g., vector registers
+	 * Utilize the same format as the XFEATURE_MASK_*
+	 */
+	u64		ext_regs_mask;
+
 	/*
 	 * Intel host/guest support (KVM)
 	 */
@@ -1280,7 +1286,8 @@ int x86_pmu_handle_irq(struct pt_regs *regs);
 
 void x86_pmu_setup_regs_data(struct perf_event *event,
 			     struct perf_sample_data *data,
-			     struct pt_regs *regs);
+			     struct pt_regs *regs,
+			     u64 ignore_mask);
 
 void x86_pmu_show_pmu_cap(struct pmu *pmu);
 
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 38fa8ff26559..19dec5f0b1c7 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -112,6 +112,8 @@ void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
 void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
+
 int xfd_enable_feature(u64 xfd_err);
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 49a4d442f3fc..8f18903ea9d0 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -596,7 +596,10 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
-	u64		*xmm_regs;
+	union {
+		u64	*xmm_regs;
+		u32	*xmm_space;	/* for xsaves */
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 1ef62a137769..1b6e761dcf04 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -586,7 +586,7 @@ static bool __init check_xstate_against_struct(int nr)
 	return true;
 }
 
-static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
 {
 	unsigned int topmost = fls64(xfeatures) -  1;
 	unsigned int offset, i;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 06/17] perf: Support SIMD registers
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (4 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 07/17] perf/x86: Move XMM to sample_simd_vec_regs Dapeng Mi
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The users may be interested in the SIMD registers in a sample while
profiling. The current sample_regs_XXX doesn't have enough space for all
SIMD registers.

Add sets of the sample_simd_{pred,vec}_reg_* in the
struct perf_event_attr to define a set of SIMD registers to dump on
samples.
The current X86 supports the XMM registers in sample_regs_XXX. To
utilize the new SIMD registers configuration method, the
sample_simd_regs_enabled should always be set. If so, the XMM space in
the sample_regs_XXX is reserved for other usage.

The SIMD registers are wider than 64. A new output format is introduced.
The number and width of SIMD registers will be dumped first, following
the register values. The number and width are the same as the user's
configuration now. If, for some reason (e.g., ARM) they are different,
an ARCH-specific perf_output_sample_simd_regs can be implemented later
separately.
Add a new ABI, PERF_SAMPLE_REGS_ABI_SIMD, to indicate the new format.
The enum perf_sample_regs_abi becomes a bitmap now. There should be no
impact on the existing tool, since the version and bitmap are the same
for 1 and 2.

Add three new __weak functions:
perf_simd_reg_value: Retrieve the value of the request SIMD register.
perf_simd_reg_validate: Validate the configuration of the SIMD
registers.
perf_simd_reg_check: Check and update the configuration of the requested
SIMD registers.

Add a new flag PERF_PMU_CAP_SIMD_REGS to indicate that the PMU has the
capability to support SIMD registers dumping. Error out if the
sample_simd_{pred,vec}_reg_* mistakenly set for a PMU that doesn't have
the capability.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 include/linux/perf_event.h      |   8 +++
 include/linux/perf_regs.h       |  26 ++++++++
 include/uapi/linux/perf_event.h |  45 ++++++++++++--
 kernel/events/core.c            | 103 +++++++++++++++++++++++++++++++-
 4 files changed, 175 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1a647a1e6d08..8e995e6a4319 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -305,6 +305,7 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_EXTENDED_HW_TYPE	0x0100
 #define PERF_PMU_CAP_AUX_PAUSE		0x0200
 #define PERF_PMU_CAP_AUX_PREFER_LARGE	0x0400
+#define PERF_PMU_CAP_SIMD_REGS		0x0800
 
 /**
  * pmu::scope
@@ -1526,6 +1527,13 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_simd_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return attr->sample_simd_regs_enabled != 0;
+}
+
 static inline bool event_has_extended_regs(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index f632c5725f16..11d198cbb33a 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -9,6 +9,32 @@ struct perf_regs {
 	struct pt_regs	*regs;
 };
 
+int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+			   u16 pred_qwords, u32 pred_mask);
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred);
+/*
+ * Check and update the configuration of the requested SIMD registers
+ *
+ * regs: Used to locate the SIMD registers
+ * ignore: A mask to ignore the check of some configuration
+ * mask: The requested vector mask
+ * nr_vectors: Number of the vector registers
+ * vec_qwords: The QWORD of the vector registers
+ * pred_mask: The requested predicate mask
+ * nr_pred: Number of the predicate registers
+ * pred_qwords: The QWORD of the predicate registers
+ *
+ * It's possible (e.g., ARM) that the number and width of the dumped
+ * SIMD registers are a little different from the request.
+ * The function is to calculate the real number and width before dumping
+ * the data.
+ */
+void perf_simd_reg_check(struct pt_regs *regs, u64 ignore,
+			 u64 mask, u16 *nr_vectors, u16 *vec_qwords,
+			 u16 pred_mask, u16 *nr_pred, u16 *pred_qwords);
+
+
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 78a362b80027..e69bc3b7a815 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -314,8 +314,9 @@ enum {
  */
 enum perf_sample_regs_abi {
 	PERF_SAMPLE_REGS_ABI_NONE		= 0,
-	PERF_SAMPLE_REGS_ABI_32			= 1,
-	PERF_SAMPLE_REGS_ABI_64			= 2,
+	PERF_SAMPLE_REGS_ABI_32			= (1 << 0),
+	PERF_SAMPLE_REGS_ABI_64			= (1 << 1),
+	PERF_SAMPLE_REGS_ABI_SIMD		= (1 << 2),
 };
 
 /*
@@ -382,6 +383,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
 #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
+#define PERF_ATTR_SIZE_VER9			168	/* Add: sample_simd_{pred,vec}_reg_* */
 
 /*
  * 'struct perf_event_attr' contains various attributes that define
@@ -543,6 +545,25 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+
+	/*
+	 * Defines set of SIMD registers to dump on samples.
+	 * The sample_simd_regs_enabled !=0 implies the
+	 * set of SIMD registers is used to config all SIMD registers.
+	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+	 * config some SIMD registers on X86.
+	 */
+	union {
+		__u16 sample_simd_regs_enabled;
+		__u16 sample_simd_pred_reg_qwords;
+	};
+	__u32 sample_simd_pred_reg_intr;
+	__u32 sample_simd_pred_reg_user;
+	__u16 sample_simd_vec_reg_qwords;
+	__u64 sample_simd_vec_reg_intr;
+	__u64 sample_simd_vec_reg_user;
+	__u32 __reserved_4;
 };
 
 /*
@@ -1016,7 +1037,15 @@ enum perf_event_type {
 	 *      } && PERF_SAMPLE_BRANCH_STACK
 	 *
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_USER
 	 *
 	 *	{ u64			size;
 	 *	  char			data[size];
@@ -1043,7 +1072,15 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fe3a01cc4d92..e87d0429474d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7440,6 +7440,49 @@ perf_output_sample_regs(struct perf_output_handle *handle,
 	}
 }
 
+static void
+perf_output_sample_simd_regs(struct perf_output_handle *handle,
+			     struct perf_event *event,
+			     struct pt_regs *regs,
+			     u64 mask, u16 pred_mask)
+{
+	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
+	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
+	u16 nr_pred = hweight16(pred_mask);
+	u16 nr_vectors = hweight64(mask);
+	int bit;
+	u64 val;
+	u16 i;
+
+	/* Get the number of available regs */
+	perf_simd_reg_check(regs, 0, mask, &nr_vectors, &vec_qwords,
+			    pred_mask, &nr_pred, &pred_qwords);
+
+	perf_output_put(handle, nr_vectors);
+	perf_output_put(handle, vec_qwords);
+	perf_output_put(handle, nr_pred);
+	perf_output_put(handle, pred_qwords);
+
+	if (nr_vectors) {
+		for_each_set_bit(bit, (unsigned long *)&mask,
+				 sizeof(mask) * BITS_PER_BYTE) {
+			for (i = 0; i < vec_qwords; i++) {
+				val = perf_simd_reg_value(regs, bit, i, false);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+	if (nr_pred) {
+		for_each_set_bit(bit, (unsigned long *)&pred_mask,
+				 sizeof(pred_mask) * BITS_PER_BYTE) {
+			for (i = 0; i < pred_qwords; i++) {
+				val = perf_simd_reg_value(regs, bit, i, true);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+}
+
 static void perf_sample_regs_user(struct perf_regs *regs_user,
 				  struct pt_regs *regs)
 {
@@ -7461,6 +7504,25 @@ static void perf_sample_regs_intr(struct perf_regs *regs_intr,
 	regs_intr->abi  = perf_reg_abi(current);
 }
 
+int __weak perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+				  u16 pred_qwords, u32 pred_mask)
+{
+	return vec_qwords || vec_mask || pred_qwords || pred_mask ? -EINVAL : 0;
+}
+
+u64 __weak perf_simd_reg_value(struct pt_regs *regs, int idx,
+			       u16 qwords_idx, bool pred)
+{
+	return 0;
+}
+
+void __weak perf_simd_reg_check(struct pt_regs *regs, u64 ignore,
+				u64 mask, u16 *nr_vectors, u16 *vec_qwords,
+				u16 pred_mask, u16 *nr_pred, u16 *pred_qwords)
+{
+	*nr_vectors = 0;
+	*nr_pred = 0;
+}
 
 /*
  * Get remaining task size from user stack pointer.
@@ -7993,10 +8055,17 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_user;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_user;
 			perf_output_sample_regs(handle,
 						data->regs_user.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_user.regs,
+							     attr->sample_simd_vec_reg_user,
+							     attr->sample_simd_pred_reg_user);
+			}
 		}
 	}
 
@@ -8024,11 +8093,18 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_intr;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_intr;
 
 			perf_output_sample_regs(handle,
 						data->regs_intr.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_intr.regs,
+							     attr->sample_simd_vec_reg_intr,
+							     attr->sample_simd_pred_reg_intr);
+			}
 		}
 	}
 
@@ -12593,6 +12669,12 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 	if (ret)
 		goto err_pmu;
 
+	if (!(pmu->capabilities & PERF_PMU_CAP_SIMD_REGS) &&
+	    event_has_simd_regs(event)) {
+		ret = -EOPNOTSUPP;
+		goto err_destroy;
+	}
+
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
 	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
@@ -13134,6 +13216,12 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		ret = perf_reg_validate(attr->sample_regs_user);
 		if (ret)
 			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_user,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_user);
+		if (ret)
+			return ret;
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_STACK_USER) {
@@ -13154,8 +13242,17 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (!attr->sample_max_stack)
 		attr->sample_max_stack = sysctl_perf_event_max_stack;
 
-	if (attr->sample_type & PERF_SAMPLE_REGS_INTR)
+	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
 		ret = perf_reg_validate(attr->sample_regs_intr);
+		if (ret)
+			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_intr,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_intr);
+		if (ret)
+			return ret;
+	}
 
 #ifndef CONFIG_CGROUP_PERF
 	if (attr->sample_type & PERF_SAMPLE_CGROUP)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 07/17] perf/x86: Move XMM to sample_simd_vec_regs
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (5 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 06/17] perf: Support SIMD registers Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 08/17] perf/x86: Add YMM into sample_simd_vec_regs Dapeng Mi
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The XMM0-15 are SIMD registers. Move them from sample_regs to
sample_simd_vec_regs. Reject access to the extended space of the
sample_regs if the new sample_simd_vec_regs is used.

The perf_reg_value requires the abi to understand the layout of the
sample_regs. Add the abi information in the struct x86_perf_regs.

Implement the X86-specific perf_simd_reg_validate to validate the SIMD
registers configuration from the user tool. Only the XMM0-15 is
supported now. More registers will be added in the following patches.
Implement the X86-specific perf_simd_reg_value to retrieve the XMM
value.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 110 ++++++++++++++++++++++++--
 arch/x86/events/intel/ds.c            |   2 +-
 arch/x86/events/perf_event.h          |  12 +++
 arch/x86/include/asm/perf_event.h     |   1 +
 arch/x86/include/uapi/asm/perf_regs.h |  17 ++++
 arch/x86/kernel/perf_regs.c           |  63 ++++++++++++++-
 6 files changed, 195 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e363f5f2b37d..7b1b3eb80aa7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -709,6 +709,22 @@ int x86_pmu_hw_config(struct perf_event *event)
 				return -EINVAL;
 			if (!event->attr.precise_ip)
 				return -EINVAL;
+			if (event->attr.sample_simd_regs_enabled)
+				return -EINVAL;
+		}
+
+		if (event_has_simd_regs(event)) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
+				return -EINVAL;
+			/* Not require any vector registers but set width */
+			if (event->attr.sample_simd_vec_reg_qwords &&
+			    !event->attr.sample_simd_vec_reg_intr &&
+			    !event->attr.sample_simd_vec_reg_user)
+				return -EINVAL;
+			/* The vector registers set is not supported */
+			if (event_needs_xmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
+				return -EINVAL;
 		}
 	}
 
@@ -1749,6 +1765,7 @@ x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
 	struct x86_perf_regs *x86_regs_user = this_cpu_ptr(&x86_user_regs);
 	struct perf_regs regs_user;
 
+	x86_regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
 	perf_get_regs_user(&regs_user, regs);
 	data->regs_user.abi = regs_user.abi;
 	if (regs_user.regs) {
@@ -1761,23 +1778,43 @@ x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
 
 static bool x86_pmu_user_req_pt_regs_only(struct perf_event *event)
 {
+	if (event->attr.sample_simd_regs_enabled)
+		return false;
 	return !(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK);
 }
 
-void x86_pmu_setup_regs_data(struct perf_event *event,
+static inline void
+x86_pmu_update_ext_regs_size(struct perf_event_attr *attr,
 			     struct perf_sample_data *data,
-			     struct pt_regs *regs,
-			     u64 ignore_mask)
+			     u64 ignore, struct pt_regs *regs,
+			     u64 mask, u16 pred_mask)
+{
+	u16 pred_qwords = attr->sample_simd_pred_reg_qwords;
+	u16 vec_qwords = attr->sample_simd_vec_reg_qwords;
+	u16 nr_pred = hweight16(pred_mask);
+	u16 nr_vectors = hweight64(mask);
+
+	perf_simd_reg_check(regs, ignore,
+			    mask, &nr_vectors, &vec_qwords,
+			    pred_mask, &nr_pred, &pred_qwords);
+	data->dyn_size += (nr_vectors * vec_qwords + nr_pred * pred_qwords) * sizeof(u64);
+}
+
+static void x86_pmu_setup_basic_regs_data(struct perf_event *event,
+					  struct perf_sample_data *data,
+					  struct pt_regs *regs)
 {
-	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
 	struct perf_event_attr *attr = &event->attr;
 	u64 sample_type = attr->sample_type;
-	u64 mask = 0;
+	struct x86_perf_regs *perf_regs;
 
 	if (!(attr->sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)))
 		return;
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
+		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		perf_regs->abi = PERF_SAMPLE_REGS_ABI_NONE;
+
 		if (user_mode(regs)) {
 			data->regs_user.abi = perf_reg_abi(current);
 			data->regs_user.regs = regs;
@@ -1799,26 +1836,83 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 		data->dyn_size += sizeof(u64);
 		if (data->regs_user.regs)
 			data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
+		perf_regs->abi |= data->regs_user.abi;
 		data->sample_flags |= PERF_SAMPLE_REGS_USER;
 	}
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR) {
+		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		perf_regs->abi = PERF_SAMPLE_REGS_ABI_NONE;
+
 		data->regs_intr.regs = regs;
 		data->regs_intr.abi = perf_reg_abi(current);
 		data->dyn_size += sizeof(u64);
 		if (data->regs_intr.regs)
 			data->dyn_size += hweight64(attr->sample_regs_intr) * sizeof(u64);
+		perf_regs->abi |= data->regs_intr.abi;
 		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
 	}
+}
+
+static void x86_pmu_setup_extended_regs_data(struct perf_event *event,
+					     struct perf_sample_data *data,
+					     struct pt_regs *regs,
+					     u64 ignore_mask)
+{
+	struct perf_event_attr *attr = &event->attr;
+	u64 sample_type = attr->sample_type;
+	struct x86_perf_regs *perf_regs;
+	u64 mask = 0;
 
-	if (event_has_extended_regs(event)) {
-		perf_regs->xmm_regs = NULL;
+	if (!attr->sample_simd_regs_enabled)
+		return;
+
+	if (!(attr->sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)))
+		return;
+
+	perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+	perf_regs->xmm_regs = NULL;
+	if (event_needs_xmm(event))
 		mask |= XFEATURE_MASK_SSE;
-	}
 
 	mask &= ~ignore_mask;
 	if (mask)
 		x86_pmu_get_ext_regs(perf_regs, mask);
+
+	/* Update the data[] size */
+	if (sample_type & PERF_SAMPLE_REGS_USER && data->regs_user.abi) {
+		/* num and qwords of vector and pred registers */
+		data->dyn_size += sizeof(u64);
+		data->regs_user.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		x86_pmu_update_ext_regs_size(attr, data, ignore_mask,
+					     data->regs_user.regs,
+					     attr->sample_simd_vec_reg_user,
+					     attr->sample_simd_pred_reg_user);
+	}
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR && data->regs_intr.abi) {
+		/* num and qwords of vector and pred registers */
+		data->dyn_size += sizeof(u64);
+		data->regs_intr.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		x86_pmu_update_ext_regs_size(attr, data, ignore_mask,
+					     data->regs_intr.regs,
+					     attr->sample_simd_vec_reg_intr,
+					     attr->sample_simd_pred_reg_intr);
+	}
+}
+
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs,
+			     u64 ignore_mask)
+{
+	x86_pmu_setup_basic_regs_data(event, data, regs);
+	/*
+	 * ignore_mask indicates the PEBS sampled extended regs which is unnessary
+	 * to sample again.
+	 */
+	x86_pmu_setup_extended_regs_data(event, data, regs, ignore_mask);
 }
 
 int x86_pmu_handle_irq(struct pt_regs *regs)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index f95dfee6adb2..59dbbc1c9968 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1415,7 +1415,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if (event_has_extended_regs(event))
+	if (event_needs_xmm(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7bf24842b1dc..6f22ed718a75 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -133,6 +133,18 @@ static inline bool is_acr_event_group(struct perf_event *event)
 	return check_leader_group(event->group_leader, PERF_X86_EVENT_ACR);
 }
 
+static inline bool event_needs_xmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_XMM_QWORDS)
+		return true;
+
+	if (!event->attr.sample_simd_regs_enabled &&
+	    event_has_extended_regs(event))
+		return true;
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 8f18903ea9d0..fd4fe31e510b 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -596,6 +596,7 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
+	u64		abi;
 	union {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..c3862e5fdd6d 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -55,4 +55,21 @@ enum perf_event_x86_regs {
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
 
+enum {
+	PERF_REG_X86_XMM,
+	PERF_REG_X86_MAX_SIMD_REGS,
+};
+
+enum {
+	PERF_X86_SIMD_XMM_REGS      = 16,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
+};
+
+#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
+
+enum {
+	PERF_X86_XMM_QWORDS      = 2,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
+};
+
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 624703af80a1..6fd691cb7e64 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -57,12 +57,29 @@ static unsigned int pt_regs_offset[PERF_REG_X86_MAX] = {
 #endif
 };
 
+void perf_simd_reg_check(struct pt_regs *regs, u64 ignore,
+			 u64 mask, u16 *nr_vectors, u16 *vec_qwords,
+			 u16 pred_mask, u16 *nr_pred, u16 *pred_qwords)
+{
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+	if (!(ignore & XFEATURE_MASK_SSE) &&
+	    *vec_qwords >= PERF_X86_XMM_QWORDS &&
+	    !perf_regs->xmm_regs)
+		*nr_vectors = 0;
+
+	*nr_pred = 0;
+}
+
 u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
 	struct x86_perf_regs *perf_regs;
 
 	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
+			return 0;
 		if (!perf_regs->xmm_regs)
 			return 0;
 		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
@@ -74,6 +91,49 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred)
+{
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+	if (pred)
+		return 0;
+
+	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
+			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
+		return 0;
+
+	if (qwords_idx < PERF_X86_XMM_QWORDS) {
+		if (!perf_regs->xmm_regs)
+			return 0;
+		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS + qwords_idx];
+	}
+
+	return 0;
+}
+
+int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+			   u16 pred_qwords, u32 pred_mask)
+{
+	/* pred_qwords implies sample_simd_{pred,vec}_reg_* are supported */
+	if (!pred_qwords)
+		return 0;
+
+	if (!vec_qwords) {
+		if (vec_mask)
+			return -EINVAL;
+	} else {
+		if (vec_qwords != PERF_X86_XMM_QWORDS)
+			return -EINVAL;
+		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
+			return -EINVAL;
+	}
+	if (pred_mask)
+		return -EINVAL;
+
+	return 0;
+}
+
 #define PERF_REG_X86_RESERVED	(((1ULL << PERF_REG_X86_XMM0) - 1) & \
 				 ~((1ULL << PERF_REG_X86_MAX) - 1))
 
@@ -114,7 +174,8 @@ void perf_get_regs_user(struct perf_regs *regs_user,
 
 int perf_reg_validate(u64 mask)
 {
-	if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
+	/* The mask could be 0 if only the SIMD registers are interested */
+	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
 		return -EINVAL;
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 08/17] perf/x86: Add YMM into sample_simd_vec_regs
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (6 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 07/17] perf/x86: Move XMM to sample_simd_vec_regs Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 09/17] perf/x86: Add ZMM " Dapeng Mi
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The YMM0-15 is composed of XMM and YMMH. It requires 2 XSAVE commands to
get the complete value. Internally, the XMM and YMMH are stored in
different structures, which follow the XSAVE format. But the output
dumps the YMM as a whole.

The qwords 4 imply YMM.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                |  9 +++++++++
 arch/x86/events/perf_event.h          |  9 +++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  8 ++++++--
 arch/x86/kernel/perf_regs.c           | 14 +++++++++++++-
 5 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7b1b3eb80aa7..8543b96eeb58 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -423,6 +423,9 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 
 	if (valid_mask & XFEATURE_MASK_SSE)
 		perf_regs->xmm_space = xsave->i387.xmm_space;
+
+	if (valid_mask & XFEATURE_MASK_YMM)
+		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
 }
 
 static void release_ext_regs_buffers(void)
@@ -725,6 +728,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_xmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
 				return -EINVAL;
+			if (event_needs_ymm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
+				return -EINVAL;
 		}
 	}
 
@@ -1875,6 +1881,9 @@ static void x86_pmu_setup_extended_regs_data(struct perf_event *event,
 	perf_regs->xmm_regs = NULL;
 	if (event_needs_xmm(event))
 		mask |= XFEATURE_MASK_SSE;
+	perf_regs->ymmh_regs = NULL;
+	if (event_needs_ymm(event))
+		mask |= XFEATURE_MASK_YMM;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 6f22ed718a75..3196191791a7 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -145,6 +145,15 @@ static inline bool event_needs_xmm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_ymm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS)
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index fd4fe31e510b..fd5338a89ba3 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -601,6 +601,10 @@ struct x86_perf_regs {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
 	};
+	union {
+		u64	*ymmh_regs;
+		struct ymmh_struct *ymmh;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index c3862e5fdd6d..4fd598785f6d 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -57,19 +57,23 @@ enum perf_event_x86_regs {
 
 enum {
 	PERF_REG_X86_XMM,
+	PERF_REG_X86_YMM,
 	PERF_REG_X86_MAX_SIMD_REGS,
 };
 
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
-	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_XMM_REGS,
+	PERF_X86_SIMD_YMM_REGS      = 16,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_YMM_REGS,
 };
 
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
 enum {
 	PERF_X86_XMM_QWORDS      = 2,
-	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_XMM_QWORDS,
+	PERF_X86_YMMH_QWORDS     = 2,
+	PERF_X86_YMM_QWORDS      = 4,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_YMM_QWORDS,
 };
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 6fd691cb7e64..1fcf8fa76607 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -68,6 +68,11 @@ void perf_simd_reg_check(struct pt_regs *regs, u64 ignore,
 	    !perf_regs->xmm_regs)
 		*nr_vectors = 0;
 
+	if (!(ignore & XFEATURE_MASK_YMM) &&
+	    *vec_qwords >= PERF_X86_YMM_QWORDS &&
+	    !perf_regs->ymmh_regs)
+		*vec_qwords = PERF_X86_XMM_QWORDS;
+
 	*nr_pred = 0;
 }
 
@@ -95,6 +100,7 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			u16 qwords_idx, bool pred)
 {
 	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+	int index;
 
 	if (pred)
 		return 0;
@@ -107,6 +113,11 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 		if (!perf_regs->xmm_regs)
 			return 0;
 		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS + qwords_idx];
+	} else if (qwords_idx < PERF_X86_YMM_QWORDS) {
+		if (!perf_regs->ymmh_regs)
+			return 0;
+		index = idx * PERF_X86_YMMH_QWORDS + qwords_idx - PERF_X86_XMM_QWORDS;
+		return perf_regs->ymmh_regs[index];
 	}
 
 	return 0;
@@ -123,7 +134,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 		if (vec_mask)
 			return -EINVAL;
 	} else {
-		if (vec_qwords != PERF_X86_XMM_QWORDS)
+		if (vec_qwords != PERF_X86_XMM_QWORDS &&
+		    vec_qwords != PERF_X86_YMM_QWORDS)
 			return -EINVAL;
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 09/17] perf/x86: Add ZMM into sample_simd_vec_regs
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (7 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 08/17] perf/x86: Add YMM into sample_simd_vec_regs Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 10/17] perf/x86: Add OPMASK into sample_simd_pred_reg Dapeng Mi
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The ZMM0-15 is composed of XMM, YMMH, and ZMMH. It requires 3 XSAVE
commands to get the complete value.
The ZMM16-31/YMM16-31/XMM16-31 are also supported, which only require
the XSAVE Hi16_ZMM.

Internally, the XMM, YMMH, ZMMH and Hi16_ZMM are stored in different
structures, which follow the XSAVE format. But the output dumps the ZMM
or Hi16 XMM/YMM/ZMM as a whole.

The qwords 8 imply ZMM.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 16 ++++++++++++++++
 arch/x86/events/perf_event.h          | 19 +++++++++++++++++++
 arch/x86/include/asm/perf_event.h     |  8 ++++++++
 arch/x86/include/uapi/asm/perf_regs.h | 11 +++++++++--
 arch/x86/kernel/perf_regs.c           | 24 +++++++++++++++++++++++-
 5 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 8543b96eeb58..87572b85d234 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -426,6 +426,10 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 
 	if (valid_mask & XFEATURE_MASK_YMM)
 		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
+	if (valid_mask & XFEATURE_MASK_ZMM_Hi256)
+		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
+	if (valid_mask & XFEATURE_MASK_Hi16_ZMM)
+		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 }
 
 static void release_ext_regs_buffers(void)
@@ -731,6 +735,12 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_ymm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
 				return -EINVAL;
+			if (event_needs_low16_zmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_ZMM_Hi256))
+				return -EINVAL;
+			if (event_needs_high16_zmm(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_Hi16_ZMM))
+				return -EINVAL;
 		}
 	}
 
@@ -1884,6 +1894,12 @@ static void x86_pmu_setup_extended_regs_data(struct perf_event *event,
 	perf_regs->ymmh_regs = NULL;
 	if (event_needs_ymm(event))
 		mask |= XFEATURE_MASK_YMM;
+	perf_regs->zmmh_regs = NULL;
+	if (event_needs_low16_zmm(event))
+		mask |= XFEATURE_MASK_ZMM_Hi256;
+	perf_regs->h16zmm_regs = NULL;
+	if (event_needs_high16_zmm(event))
+		mask |= XFEATURE_MASK_Hi16_ZMM;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 3196191791a7..3d6a5739d86e 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -154,6 +154,25 @@ static inline bool event_needs_ymm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_low16_zmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_ZMM_QWORDS)
+		return true;
+
+	return false;
+}
+
+static inline bool event_needs_high16_zmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+	     fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index fd5338a89ba3..44e89adedc61 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -605,6 +605,14 @@ struct x86_perf_regs {
 		u64	*ymmh_regs;
 		struct ymmh_struct *ymmh;
 	};
+	union {
+		u64	*zmmh_regs;
+		struct avx_512_zmm_uppers_state *zmmh;
+	};
+	union {
+		u64	*h16zmm_regs;
+		struct avx_512_hi16_state *h16zmm;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 4fd598785f6d..96db454c7923 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -58,22 +58,29 @@ enum perf_event_x86_regs {
 enum {
 	PERF_REG_X86_XMM,
 	PERF_REG_X86_YMM,
+	PERF_REG_X86_ZMM,
 	PERF_REG_X86_MAX_SIMD_REGS,
 };
 
 enum {
 	PERF_X86_SIMD_XMM_REGS      = 16,
 	PERF_X86_SIMD_YMM_REGS      = 16,
-	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_YMM_REGS,
+	PERF_X86_SIMD_ZMMH_REGS     = 16,
+	PERF_X86_SIMD_ZMM_REGS      = 32,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
 };
 
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
+#define PERF_X86_H16ZMM_BASE		PERF_X86_SIMD_ZMMH_REGS
+
 enum {
 	PERF_X86_XMM_QWORDS      = 2,
 	PERF_X86_YMMH_QWORDS     = 2,
 	PERF_X86_YMM_QWORDS      = 4,
-	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_YMM_QWORDS,
+	PERF_X86_ZMMH_QWORDS     = 4,
+	PERF_X86_ZMM_QWORDS      = 8,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
 };
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 1fcf8fa76607..8d877b2be957 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -73,6 +73,16 @@ void perf_simd_reg_check(struct pt_regs *regs, u64 ignore,
 	    !perf_regs->ymmh_regs)
 		*vec_qwords = PERF_X86_XMM_QWORDS;
 
+	if (!(ignore & XFEATURE_MASK_ZMM_Hi256) &&
+	    *vec_qwords >= PERF_X86_ZMM_QWORDS &&
+	    !perf_regs->zmmh_regs)
+		*vec_qwords = PERF_X86_YMM_QWORDS;
+
+	if (!(ignore & XFEATURE_MASK_Hi16_ZMM) &&
+	    *nr_vectors > PERF_X86_H16ZMM_BASE &&
+	    !perf_regs->h16zmm_regs)
+		*nr_vectors = PERF_X86_H16ZMM_BASE;
+
 	*nr_pred = 0;
 }
 
@@ -109,6 +119,12 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
 		return 0;
 
+	if (idx >= PERF_X86_H16ZMM_BASE) {
+		if (!perf_regs->h16zmm_regs)
+			return 0;
+		return perf_regs->h16zmm_regs[idx * PERF_X86_ZMM_QWORDS + qwords_idx];
+	}
+
 	if (qwords_idx < PERF_X86_XMM_QWORDS) {
 		if (!perf_regs->xmm_regs)
 			return 0;
@@ -118,6 +134,11 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			return 0;
 		index = idx * PERF_X86_YMMH_QWORDS + qwords_idx - PERF_X86_XMM_QWORDS;
 		return perf_regs->ymmh_regs[index];
+	} else if (qwords_idx < PERF_X86_ZMM_QWORDS) {
+		if (!perf_regs->zmmh_regs)
+			return 0;
+		index = idx * PERF_X86_ZMMH_QWORDS + qwords_idx - PERF_X86_YMM_QWORDS;
+		return perf_regs->zmmh_regs[index];
 	}
 
 	return 0;
@@ -135,7 +156,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 			return -EINVAL;
 	} else {
 		if (vec_qwords != PERF_X86_XMM_QWORDS &&
-		    vec_qwords != PERF_X86_YMM_QWORDS)
+		    vec_qwords != PERF_X86_YMM_QWORDS &&
+		    vec_qwords != PERF_X86_ZMM_QWORDS)
 			return -EINVAL;
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 10/17] perf/x86: Add OPMASK into sample_simd_pred_reg
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (8 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 09/17] perf/x86: Add ZMM " Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 11/17] perf/x86: Add eGPRs into sample_regs Dapeng Mi
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The OPMASK is the SIMD's predicate registers. Add them into
sample_simd_pred_reg. The qwords of OPMASK is 1. There are 8 registers.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                |  8 ++++++++
 arch/x86/events/perf_event.h          |  9 +++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  8 ++++++++
 arch/x86/kernel/perf_regs.c           | 19 +++++++++++++++----
 5 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 87572b85d234..c942c6f808ca 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -430,6 +430,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
 	if (valid_mask & XFEATURE_MASK_Hi16_ZMM)
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
+	if (valid_mask & XFEATURE_MASK_OPMASK)
+		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 }
 
 static void release_ext_regs_buffers(void)
@@ -741,6 +743,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_high16_zmm(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_Hi16_ZMM))
 				return -EINVAL;
+			if (event_needs_opmask(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_OPMASK))
+				return -EINVAL;
 		}
 	}
 
@@ -1900,6 +1905,9 @@ static void x86_pmu_setup_extended_regs_data(struct perf_event *event,
 	perf_regs->h16zmm_regs = NULL;
 	if (event_needs_high16_zmm(event))
 		mask |= XFEATURE_MASK_Hi16_ZMM;
+	perf_regs->opmask_regs = NULL;
+	if (event_needs_opmask(event))
+		mask |= XFEATURE_MASK_OPMASK;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 3d6a5739d86e..4584de1c79a3 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -173,6 +173,15 @@ static inline bool event_needs_high16_zmm(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_opmask(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_simd_pred_reg_intr || event->attr.sample_simd_pred_reg_user))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 44e89adedc61..d8cac3f9f8df 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -613,6 +613,10 @@ struct x86_perf_regs {
 		u64	*h16zmm_regs;
 		struct avx_512_hi16_state *h16zmm;
 	};
+	union {
+		u64	*opmask_regs;
+		struct avx_512_opmask_state *opmask;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 96db454c7923..6f29fd9495a2 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -60,6 +60,9 @@ enum {
 	PERF_REG_X86_YMM,
 	PERF_REG_X86_ZMM,
 	PERF_REG_X86_MAX_SIMD_REGS,
+
+	PERF_REG_X86_OPMASK = 0,
+	PERF_REG_X86_MAX_PRED_REGS = 1,
 };
 
 enum {
@@ -68,13 +71,18 @@ enum {
 	PERF_X86_SIMD_ZMMH_REGS     = 16,
 	PERF_X86_SIMD_ZMM_REGS      = 32,
 	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
+
+	PERF_X86_SIMD_OPMASK_REGS   = 8,
+	PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
 };
 
+#define PERF_X86_SIMD_PRED_MASK		GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
 #define PERF_X86_H16ZMM_BASE		PERF_X86_SIMD_ZMMH_REGS
 
 enum {
+	PERF_X86_OPMASK_QWORDS   = 1,
 	PERF_X86_XMM_QWORDS      = 2,
 	PERF_X86_YMMH_QWORDS     = 2,
 	PERF_X86_YMM_QWORDS      = 4,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 8d877b2be957..f6e9cde11ba1 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -83,7 +83,9 @@ void perf_simd_reg_check(struct pt_regs *regs, u64 ignore,
 	    !perf_regs->h16zmm_regs)
 		*nr_vectors = PERF_X86_H16ZMM_BASE;
 
-	*nr_pred = 0;
+	if (!(ignore & XFEATURE_MASK_OPMASK) &&
+	    *nr_pred && !perf_regs->opmask_regs)
+		*nr_pred = 0;
 }
 
 u64 perf_reg_value(struct pt_regs *regs, int idx)
@@ -112,8 +114,14 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
 	int index;
 
-	if (pred)
-		return 0;
+	if (pred) {
+		if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_PRED_REGS_MAX ||
+				 qwords_idx >= PERF_X86_OPMASK_QWORDS))
+			return 0;
+		if (!perf_regs->opmask_regs)
+			return 0;
+		return perf_regs->opmask_regs[idx];
+	}
 
 	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
@@ -162,7 +170,10 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
 	}
-	if (pred_mask)
+
+	if (pred_qwords != PERF_X86_OPMASK_QWORDS)
+		return -EINVAL;
+	if (pred_mask & ~PERF_X86_SIMD_PRED_MASK)
 		return -EINVAL;
 
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 11/17] perf/x86: Add eGPRs into sample_regs
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (9 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 10/17] perf/x86: Add OPMASK into sample_simd_pred_reg Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 12/17] perf/x86: Add SSP " Dapeng Mi
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The eGPRs is only supported when the new SIMD registers configuration
method is used, which moves the XMM to sample_simd_vec_regs. So the
space can be reclaimed for the eGPRs.

The eGPRs is retrieved by XSAVE. Only support the eGPRs for X86_64.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 41 ++++++++++++++++++---------
 arch/x86/events/perf_event.h          | 10 +++++++
 arch/x86/include/asm/perf_event.h     |  4 +++
 arch/x86/include/uapi/asm/perf_regs.h | 26 ++++++++++++++++-
 arch/x86/kernel/perf_regs.c           | 31 ++++++++++----------
 5 files changed, 83 insertions(+), 29 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index c942c6f808ca..a435610f4d4a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -432,6 +432,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 	if (valid_mask & XFEATURE_MASK_OPMASK)
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
+	if (valid_mask & XFEATURE_MASK_APX)
+		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
 }
 
 static void release_ext_regs_buffers(void)
@@ -709,22 +711,21 @@ int x86_pmu_hw_config(struct perf_event *event)
 	}
 
 	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
-		/*
-		 * Besides the general purpose registers, XMM registers may
-		 * be collected as well.
-		 */
-		if (event_has_extended_regs(event)) {
-			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
-				return -EINVAL;
-			if (!event->attr.precise_ip)
-				return -EINVAL;
-			if (event->attr.sample_simd_regs_enabled)
-				return -EINVAL;
-		}
-
 		if (event_has_simd_regs(event)) {
+			u64 reserved = ~GENMASK_ULL(PERF_REG_X86_64_MAX - 1, 0);
+
 			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
 				return -EINVAL;
+			/*
+			 * The XMM space in the perf_event_x86_regs is reclaimed
+			 * for eGPRs and other general registers.
+			 */
+			if (event->attr.sample_regs_user & reserved ||
+			    event->attr.sample_regs_intr & reserved)
+				return -EINVAL;
+			if (event_needs_egprs(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
+				return -EINVAL;
 			/* Not require any vector registers but set width */
 			if (event->attr.sample_simd_vec_reg_qwords &&
 			    !event->attr.sample_simd_vec_reg_intr &&
@@ -746,6 +747,17 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_opmask(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_OPMASK))
 				return -EINVAL;
+		} else {
+			/*
+			 * Besides the general purpose registers, XMM registers may
+			 * be collected as well.
+			 */
+			if (event_has_extended_regs(event)) {
+				if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+					return -EINVAL;
+				if (!event->attr.precise_ip)
+					return -EINVAL;
+			}
 		}
 	}
 
@@ -1908,6 +1920,9 @@ static void x86_pmu_setup_extended_regs_data(struct perf_event *event,
 	perf_regs->opmask_regs = NULL;
 	if (event_needs_opmask(event))
 		mask |= XFEATURE_MASK_OPMASK;
+	perf_regs->egpr_regs = NULL;
+	if (event_needs_egprs(event))
+		mask |= XFEATURE_MASK_APX;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 4584de1c79a3..3dd0e669ddd4 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -182,6 +182,16 @@ static inline bool event_needs_opmask(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_egprs(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_regs_user & PERF_X86_EGPRS_MASK ||
+	     event->attr.sample_regs_intr & PERF_X86_EGPRS_MASK))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index d8cac3f9f8df..73c2064c13f9 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -617,6 +617,10 @@ struct x86_perf_regs {
 		u64	*opmask_regs;
 		struct avx_512_opmask_state *opmask;
 	};
+	union {
+		u64	*egpr_regs;
+		struct apx_state *egpr;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 6f29fd9495a2..38644de89815 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,32 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/*
+	 * The EGPRs and XMM have overlaps. Only one can be used
+	 * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
+	 * utilize EGPRs. For the other ABI type, XMM is used.
+	 *
+	 * Extended GPRs (EGPRs)
+	 */
+	PERF_REG_X86_R16,
+	PERF_REG_X86_R17,
+	PERF_REG_X86_R18,
+	PERF_REG_X86_R19,
+	PERF_REG_X86_R20,
+	PERF_REG_X86_R21,
+	PERF_REG_X86_R22,
+	PERF_REG_X86_R23,
+	PERF_REG_X86_R24,
+	PERF_REG_X86_R25,
+	PERF_REG_X86_R26,
+	PERF_REG_X86_R27,
+	PERF_REG_X86_R28,
+	PERF_REG_X86_R29,
+	PERF_REG_X86_R30,
+	PERF_REG_X86_R31,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
-	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	PERF_REG_X86_64_MAX = PERF_REG_X86_R31 + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
@@ -54,6 +77,7 @@ enum perf_event_x86_regs {
 };
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
+#define PERF_X86_EGPRS_MASK	GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
 
 enum {
 	PERF_REG_X86_XMM,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index f6e9cde11ba1..b98b47a79d02 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -92,14 +92,22 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
 	struct x86_perf_regs *perf_regs;
 
-	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+	if (idx > PERF_REG_X86_R15) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
-		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
-		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
-			return 0;
-		if (!perf_regs->xmm_regs)
-			return 0;
-		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+			if (idx <= PERF_REG_X86_R31) {
+				if (!perf_regs->egpr_regs)
+					return 0;
+				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
+			}
+		} else {
+			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+				if (!perf_regs->xmm_regs)
+					return 0;
+				return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+			}
+		}
 	}
 
 	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
@@ -183,14 +191,7 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 				 ~((1ULL << PERF_REG_X86_MAX) - 1))
 
 #ifdef CONFIG_X86_32
-#define REG_NOSUPPORT ((1ULL << PERF_REG_X86_R8) | \
-		       (1ULL << PERF_REG_X86_R9) | \
-		       (1ULL << PERF_REG_X86_R10) | \
-		       (1ULL << PERF_REG_X86_R11) | \
-		       (1ULL << PERF_REG_X86_R12) | \
-		       (1ULL << PERF_REG_X86_R13) | \
-		       (1ULL << PERF_REG_X86_R14) | \
-		       (1ULL << PERF_REG_X86_R15))
+#define REG_NOSUPPORT GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R8)
 
 int perf_reg_validate(u64 mask)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 12/17] perf/x86: Add SSP into sample_regs
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (10 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 11/17] perf/x86: Add eGPRs into sample_regs Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 13/17] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS Dapeng Mi
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The SSP is only supported when the new SIMD registers configuration
method is used, which moves the XMM to sample_simd_vec_regs. So the
space can be reclaimed for the SSP.

The SSP is retrieved by XSAVE. Only support the SSP for X86_64.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 11 ++++++++++-
 arch/x86/events/perf_event.h          | 10 ++++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  7 +++++--
 arch/x86/kernel/perf_regs.c           |  8 +++++++-
 5 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index a435610f4d4a..7c29c9029379 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -434,6 +434,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 	if (valid_mask & XFEATURE_MASK_APX)
 		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
+	if (valid_mask & XFEATURE_MASK_CET_USER)
+		perf_regs->cet = get_xsave_addr(xsave, XFEATURE_CET_USER);
 }
 
 static void release_ext_regs_buffers(void)
@@ -712,7 +714,7 @@ int x86_pmu_hw_config(struct perf_event *event)
 
 	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
 		if (event_has_simd_regs(event)) {
-			u64 reserved = ~GENMASK_ULL(PERF_REG_X86_64_MAX - 1, 0);
+			u64 reserved = ~GENMASK_ULL(PERF_REG_MISC_MAX - 1, 0);
 
 			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
 				return -EINVAL;
@@ -726,6 +728,10 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event_needs_egprs(event) &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
 				return -EINVAL;
+			if (event_needs_ssp(event) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_CET_USER))
+				return -EINVAL;
+
 			/* Not require any vector registers but set width */
 			if (event->attr.sample_simd_vec_reg_qwords &&
 			    !event->attr.sample_simd_vec_reg_intr &&
@@ -1923,6 +1929,9 @@ static void x86_pmu_setup_extended_regs_data(struct perf_event *event,
 	perf_regs->egpr_regs = NULL;
 	if (event_needs_egprs(event))
 		mask |= XFEATURE_MASK_APX;
+	perf_regs->cet_regs = NULL;
+	if (event_needs_ssp(event))
+		mask |= XFEATURE_MASK_CET_USER;
 
 	mask &= ~ignore_mask;
 	if (mask)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 3dd0e669ddd4..6ff4aa23833f 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -192,6 +192,16 @@ static inline bool event_needs_egprs(struct perf_event *event)
 	return false;
 }
 
+static inline bool event_needs_ssp(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    (event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP) ||
+	     event->attr.sample_regs_intr & BIT_ULL(PERF_REG_X86_SSP)))
+		return true;
+
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 73c2064c13f9..9d10299355c5 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -621,6 +621,10 @@ struct x86_perf_regs {
 		u64	*egpr_regs;
 		struct apx_state *egpr;
 	};
+	union {
+		u64	*cet_regs;
+		struct cet_user_state *cet;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 38644de89815..0cf0490c47b2 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -28,9 +28,9 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
 	/*
-	 * The EGPRs and XMM have overlaps. Only one can be used
+	 * The EGPRs/SSP and XMM have overlaps. Only one can be used
 	 * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
-	 * utilize EGPRs. For the other ABI type, XMM is used.
+	 * utilize EGPRs/SSP. For the other ABI type, XMM is used.
 	 *
 	 * Extended GPRs (EGPRs)
 	 */
@@ -54,6 +54,9 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R31 + 1,
 
+	PERF_REG_X86_SSP,
+	PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
+
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
 	PERF_REG_X86_XMM1  = 34,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index b98b47a79d02..4d519867a3ef 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -101,6 +101,11 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 					return 0;
 				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
 			}
+			if (idx == PERF_REG_X86_SSP) {
+				if (!perf_regs->cet_regs)
+					return 0;
+				return perf_regs->cet_regs[1];
+			}
 		} else {
 			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 				if (!perf_regs->xmm_regs)
@@ -191,7 +196,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 				 ~((1ULL << PERF_REG_X86_MAX) - 1))
 
 #ifdef CONFIG_X86_32
-#define REG_NOSUPPORT GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R8)
+#define REG_NOSUPPORT (GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R8) | \
+		       BIT_ULL(PERF_REG_X86_SSP))
 
 int perf_reg_validate(u64 mask)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 13/17] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (11 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 12/17] perf/x86: Add SSP " Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 14/17] perf tools: Only support legacy regs for the PT and PERF_REGS_MASK Dapeng Mi
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Enable PERF_PMU_CAP_SIMD_REGS if there is XSAVES support for YMM, ZMM,
OPMASK, eGPRs, or SSP.

Disable large PEBS for these registers since PEBS HW doesn't support
them yet.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c | 48 ++++++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 2575ec0d2b77..dd46629c1ce6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4030,8 +4030,32 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 		flags &= ~PERF_SAMPLE_TIME;
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
-		flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
+	if (event->attr.sample_simd_regs_enabled) {
+		u64 nolarge = PERF_X86_EGPRS_MASK | BIT_ULL(PERF_REG_X86_SSP);
+
+		/*
+		 * PEBS HW can only collect the XMM0-XMM15 for now.
+		 * Disable large PEBS for other vector registers, predicate
+		 * registers, eGPRs, and SSP.
+		 */
+		if (event->attr.sample_regs_user & nolarge ||
+		    fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE ||
+		    event->attr.sample_simd_pred_reg_user)
+			flags &= ~PERF_SAMPLE_REGS_USER;
+
+		if (event->attr.sample_regs_intr & nolarge ||
+		    fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+		    event->attr.sample_simd_pred_reg_intr)
+			flags &= ~PERF_SAMPLE_REGS_INTR;
+
+		if (event->attr.sample_simd_vec_reg_qwords > PERF_X86_XMM_QWORDS)
+			flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
+	} else {
+		if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
+			flags &= ~PERF_SAMPLE_REGS_USER;
+		if (event->attr.sample_regs_intr & ~PEBS_GP_REGS)
+			flags &= ~PERF_SAMPLE_REGS_INTR;
+	}
 	return flags;
 }
 
@@ -5292,6 +5316,26 @@ static void intel_extended_regs_init(struct pmu *pmu)
 
 	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
 	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+
+	if (boot_cpu_has(X86_FEATURE_AVX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_YMM, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_YMM;
+	if (boot_cpu_has(X86_FEATURE_APX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_APX, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_APX;
+	if (boot_cpu_has(X86_FEATURE_AVX512F)) {
+		if (cpu_has_xfeatures(XFEATURE_MASK_OPMASK, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_OPMASK;
+		if (cpu_has_xfeatures(XFEATURE_MASK_ZMM_Hi256, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_ZMM_Hi256;
+		if (cpu_has_xfeatures(XFEATURE_MASK_Hi16_ZMM, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_Hi16_ZMM;
+	}
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_CET_USER;
+
+	if (x86_pmu.ext_regs_mask != XFEATURE_MASK_SSE)
+		x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_SIMD_REGS;
 }
 
 static void update_pmu_cap(struct pmu *pmu)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 14/17] perf tools: Only support legacy regs for the PT and PERF_REGS_MASK
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (12 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 13/17] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 15/17] perf tools: headers: Sync with the kernel sources Dapeng Mi
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

The PERF_REG_X86_64_MAX is going to be updated to support more regs,
e.g., eGPRs.
However, the PT and PERF_REGS_MASK will not be touched in the POC.
Using the PERF_REG_X86_R15 + 1 to replace PERF_REG_X86_64_MAX.

Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/perf/arch/x86/include/perf_regs.h | 2 +-
 tools/perf/util/intel-pt.c              | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/arch/x86/include/perf_regs.h b/tools/perf/arch/x86/include/perf_regs.h
index f209ce2c1dd9..793fb597b03f 100644
--- a/tools/perf/arch/x86/include/perf_regs.h
+++ b/tools/perf/arch/x86/include/perf_regs.h
@@ -17,7 +17,7 @@ void perf_regs_load(u64 *regs);
 		       (1ULL << PERF_REG_X86_ES) | \
 		       (1ULL << PERF_REG_X86_FS) | \
 		       (1ULL << PERF_REG_X86_GS))
-#define PERF_REGS_MASK (((1ULL << PERF_REG_X86_64_MAX) - 1) & ~REG_NOSUPPORT)
+#define PERF_REGS_MASK (((1ULL << (PERF_REG_X86_R15 + 1)) - 1) & ~REG_NOSUPPORT)
 #define PERF_SAMPLE_REGS_ABI PERF_SAMPLE_REGS_ABI_64
 #endif
 
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index 9b1011fe4826..a9585524f2e1 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -2181,7 +2181,7 @@ static u64 *intel_pt_add_gp_regs(struct regs_dump *intr_regs, u64 *pos,
 	u32 bit;
 	int i;
 
-	for (i = 0, bit = 1; i < PERF_REG_X86_64_MAX; i++, bit <<= 1) {
+	for (i = 0, bit = 1; i < PERF_REG_X86_R15 + 1; i++, bit <<= 1) {
 		/* Get the PEBS gp_regs array index */
 		int n = pebs_gp_regs[i] - 1;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 15/17] perf tools: headers: Sync with the kernel sources
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (13 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 14/17] perf tools: Only support legacy regs for the PT and PERF_REGS_MASK Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 16/17] perf tools: parse-regs: Support the new SIMD format Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 17/17] perf tools: regs: Support to dump regs for PERF_SAMPLE_REGS_ABI_SIMD Dapeng Mi
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Update include/uapi/linux/perf_event.h and
arch/x86/include/uapi/asm/perf_regs.h to support extended regs.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/arch/x86/include/uapi/asm/perf_regs.h | 65 ++++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h       | 45 ++++++++++++--
 2 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..0cf0490c47b2 100644
--- a/tools/arch/x86/include/uapi/asm/perf_regs.h
+++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,35 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/*
+	 * The EGPRs/SSP and XMM have overlaps. Only one can be used
+	 * at a time. For the ABI type PERF_SAMPLE_REGS_ABI_SIMD,
+	 * utilize EGPRs/SSP. For the other ABI type, XMM is used.
+	 *
+	 * Extended GPRs (EGPRs)
+	 */
+	PERF_REG_X86_R16,
+	PERF_REG_X86_R17,
+	PERF_REG_X86_R18,
+	PERF_REG_X86_R19,
+	PERF_REG_X86_R20,
+	PERF_REG_X86_R21,
+	PERF_REG_X86_R22,
+	PERF_REG_X86_R23,
+	PERF_REG_X86_R24,
+	PERF_REG_X86_R25,
+	PERF_REG_X86_R26,
+	PERF_REG_X86_R27,
+	PERF_REG_X86_R28,
+	PERF_REG_X86_R29,
+	PERF_REG_X86_R30,
+	PERF_REG_X86_R31,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
-	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	PERF_REG_X86_64_MAX = PERF_REG_X86_R31 + 1,
+
+	PERF_REG_X86_SSP,
+	PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
@@ -54,5 +80,42 @@ enum perf_event_x86_regs {
 };
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
+#define PERF_X86_EGPRS_MASK	GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
+
+enum {
+	PERF_REG_X86_XMM,
+	PERF_REG_X86_YMM,
+	PERF_REG_X86_ZMM,
+	PERF_REG_X86_MAX_SIMD_REGS,
+
+	PERF_REG_X86_OPMASK = 0,
+	PERF_REG_X86_MAX_PRED_REGS = 1,
+};
+
+enum {
+	PERF_X86_SIMD_XMM_REGS      = 16,
+	PERF_X86_SIMD_YMM_REGS      = 16,
+	PERF_X86_SIMD_ZMMH_REGS     = 16,
+	PERF_X86_SIMD_ZMM_REGS      = 32,
+	PERF_X86_SIMD_VEC_REGS_MAX  = PERF_X86_SIMD_ZMM_REGS,
+
+	PERF_X86_SIMD_OPMASK_REGS   = 8,
+	PERF_X86_SIMD_PRED_REGS_MAX = PERF_X86_SIMD_OPMASK_REGS,
+};
+
+#define PERF_X86_SIMD_PRED_MASK		GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
+#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
+
+#define PERF_X86_H16ZMM_BASE		PERF_X86_SIMD_ZMMH_REGS
+
+enum {
+	PERF_X86_OPMASK_QWORDS   = 1,
+	PERF_X86_XMM_QWORDS      = 2,
+	PERF_X86_YMMH_QWORDS     = 2,
+	PERF_X86_YMM_QWORDS      = 4,
+	PERF_X86_ZMMH_QWORDS     = 4,
+	PERF_X86_ZMM_QWORDS      = 8,
+	PERF_X86_SIMD_QWORDS_MAX = PERF_X86_ZMM_QWORDS,
+};
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 78a362b80027..e69bc3b7a815 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -314,8 +314,9 @@ enum {
  */
 enum perf_sample_regs_abi {
 	PERF_SAMPLE_REGS_ABI_NONE		= 0,
-	PERF_SAMPLE_REGS_ABI_32			= 1,
-	PERF_SAMPLE_REGS_ABI_64			= 2,
+	PERF_SAMPLE_REGS_ABI_32			= (1 << 0),
+	PERF_SAMPLE_REGS_ABI_64			= (1 << 1),
+	PERF_SAMPLE_REGS_ABI_SIMD		= (1 << 2),
 };
 
 /*
@@ -382,6 +383,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
 #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
+#define PERF_ATTR_SIZE_VER9			168	/* Add: sample_simd_{pred,vec}_reg_* */
 
 /*
  * 'struct perf_event_attr' contains various attributes that define
@@ -543,6 +545,25 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+
+	/*
+	 * Defines set of SIMD registers to dump on samples.
+	 * The sample_simd_regs_enabled !=0 implies the
+	 * set of SIMD registers is used to config all SIMD registers.
+	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+	 * config some SIMD registers on X86.
+	 */
+	union {
+		__u16 sample_simd_regs_enabled;
+		__u16 sample_simd_pred_reg_qwords;
+	};
+	__u32 sample_simd_pred_reg_intr;
+	__u32 sample_simd_pred_reg_user;
+	__u16 sample_simd_vec_reg_qwords;
+	__u64 sample_simd_vec_reg_intr;
+	__u64 sample_simd_vec_reg_user;
+	__u32 __reserved_4;
 };
 
 /*
@@ -1016,7 +1037,15 @@ enum perf_event_type {
 	 *      } && PERF_SAMPLE_BRANCH_STACK
 	 *
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_USER
 	 *
 	 *	{ u64			size;
 	 *	  char			data[size];
@@ -1043,7 +1072,15 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 16/17] perf tools: parse-regs: Support the new SIMD format
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (14 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 15/17] perf tools: headers: Sync with the kernel sources Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  2025-09-25  6:12 ` [Patch v4 17/17] perf tools: regs: Support to dump regs for PERF_SAMPLE_REGS_ABI_SIMD Dapeng Mi
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Add has_cap_simd_regs() to check if the new SIMD format is available.
If yes, get the possible mask and qwords.

Add several __weak functions to return qwords and mask for vector and
pred registers.

Only support collecting the vector and pred as a whole, and only the
superset. For example, -I XMM,YMM. Only collect all 16 YMMs.

Examples:
 $perf record -I?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7

 $perf record --user-regs=?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/perf/arch/x86/util/perf_regs.c      | 443 +++++++++++++++++++++-
 tools/perf/util/evsel.c                   |  25 ++
 tools/perf/util/parse-regs-options.c      | 133 ++++++-
 tools/perf/util/perf_event_attr_fprintf.c |   6 +
 tools/perf/util/perf_regs.c               |  54 +++
 tools/perf/util/perf_regs.h               |  10 +
 tools/perf/util/record.h                  |   6 +
 7 files changed, 663 insertions(+), 14 deletions(-)

diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
index 12fd93f04802..2e7a93d34cd1 100644
--- a/tools/perf/arch/x86/util/perf_regs.c
+++ b/tools/perf/arch/x86/util/perf_regs.c
@@ -13,6 +13,49 @@
 #include "../../../util/pmu.h"
 #include "../../../util/pmus.h"
 
+static const struct sample_reg sample_reg_masks_ext[] = {
+	SMPL_REG(AX, PERF_REG_X86_AX),
+	SMPL_REG(BX, PERF_REG_X86_BX),
+	SMPL_REG(CX, PERF_REG_X86_CX),
+	SMPL_REG(DX, PERF_REG_X86_DX),
+	SMPL_REG(SI, PERF_REG_X86_SI),
+	SMPL_REG(DI, PERF_REG_X86_DI),
+	SMPL_REG(BP, PERF_REG_X86_BP),
+	SMPL_REG(SP, PERF_REG_X86_SP),
+	SMPL_REG(IP, PERF_REG_X86_IP),
+	SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
+	SMPL_REG(CS, PERF_REG_X86_CS),
+	SMPL_REG(SS, PERF_REG_X86_SS),
+#ifdef HAVE_ARCH_X86_64_SUPPORT
+	SMPL_REG(R8, PERF_REG_X86_R8),
+	SMPL_REG(R9, PERF_REG_X86_R9),
+	SMPL_REG(R10, PERF_REG_X86_R10),
+	SMPL_REG(R11, PERF_REG_X86_R11),
+	SMPL_REG(R12, PERF_REG_X86_R12),
+	SMPL_REG(R13, PERF_REG_X86_R13),
+	SMPL_REG(R14, PERF_REG_X86_R14),
+	SMPL_REG(R15, PERF_REG_X86_R15),
+	SMPL_REG(R16, PERF_REG_X86_R16),
+	SMPL_REG(R17, PERF_REG_X86_R17),
+	SMPL_REG(R18, PERF_REG_X86_R18),
+	SMPL_REG(R19, PERF_REG_X86_R19),
+	SMPL_REG(R20, PERF_REG_X86_R20),
+	SMPL_REG(R21, PERF_REG_X86_R21),
+	SMPL_REG(R22, PERF_REG_X86_R22),
+	SMPL_REG(R23, PERF_REG_X86_R23),
+	SMPL_REG(R24, PERF_REG_X86_R24),
+	SMPL_REG(R25, PERF_REG_X86_R25),
+	SMPL_REG(R26, PERF_REG_X86_R26),
+	SMPL_REG(R27, PERF_REG_X86_R27),
+	SMPL_REG(R28, PERF_REG_X86_R28),
+	SMPL_REG(R29, PERF_REG_X86_R29),
+	SMPL_REG(R30, PERF_REG_X86_R30),
+	SMPL_REG(R31, PERF_REG_X86_R31),
+	SMPL_REG(SSP, PERF_REG_X86_SSP),
+#endif
+	SMPL_REG_END
+};
+
 static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG(AX, PERF_REG_X86_AX),
 	SMPL_REG(BX, PERF_REG_X86_BX),
@@ -276,27 +319,377 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
 	return SDT_ARG_VALID;
 }
 
+static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
+{
+	struct perf_event_attr attr = {
+		.type				= PERF_TYPE_HARDWARE,
+		.config				= PERF_COUNT_HW_CPU_CYCLES,
+		.sample_type			= sample_type,
+		.disabled			= 1,
+		.exclude_kernel			= 1,
+		.sample_simd_regs_enabled	= 1,
+	};
+	int fd;
+
+	attr.sample_period = 1;
+
+	if (!pred) {
+		attr.sample_simd_vec_reg_qwords = qwords;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			attr.sample_simd_vec_reg_intr = mask;
+		else
+			attr.sample_simd_vec_reg_user = mask;
+	} else {
+		attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
+		else
+			attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
+	}
+
+	if (perf_pmus__num_core_pmus() > 1) {
+		struct perf_pmu *pmu = NULL;
+		__u64 type = PERF_TYPE_RAW;
+
+		/*
+		 * The same register set is supported among different hybrid PMUs.
+		 * Only check the first available one.
+		 */
+		while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
+			type = pmu->type;
+			break;
+		}
+		attr.config |= type << PERF_PMU_TYPE_SHIFT;
+	}
+
+	event_attr_init(&attr);
+
+	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+	if (fd != -1) {
+		close(fd);
+		return true;
+	}
+
+	return false;
+}
+
+static bool __arch_simd_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
+{
+	bool supported = false;
+	u64 bits;
+
+	*mask = 0;
+	*qwords = 0;
+
+	switch (reg) {
+	case PERF_REG_X86_XMM:
+		bits = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_XMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_XMM_QWORDS;
+		}
+		break;
+	case PERF_REG_X86_YMM:
+		bits = BIT_ULL(PERF_X86_SIMD_YMM_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_YMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_YMM_QWORDS;
+		}
+		break;
+	case PERF_REG_X86_ZMM:
+		bits = BIT_ULL(PERF_X86_SIMD_ZMM_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_ZMM_QWORDS;
+			break;
+		}
+
+		bits = BIT_ULL(PERF_X86_SIMD_ZMMH_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_ZMM_QWORDS, bits, false);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_ZMMH_QWORDS;
+		}
+		break;
+	default:
+		break;
+	}
+
+	return supported;
+}
+
+static bool __arch_pred_reg_mask(u64 sample_type, int reg, uint64_t *mask, u16 *qwords)
+{
+	bool supported = false;
+	u64 bits;
+
+	*mask = 0;
+	*qwords = 0;
+
+	switch (reg) {
+	case PERF_REG_X86_OPMASK:
+		bits = BIT_ULL(PERF_X86_SIMD_OPMASK_REGS) - 1;
+		supported = support_simd_reg(sample_type, PERF_X86_OPMASK_QWORDS, bits, true);
+		if (supported) {
+			*mask = bits;
+			*qwords = PERF_X86_OPMASK_QWORDS;
+		}
+		break;
+	default:
+		break;
+	}
+
+	return supported;
+}
+
+static bool has_cap_simd_regs(void)
+{
+	uint64_t mask = BIT_ULL(PERF_X86_SIMD_XMM_REGS) - 1;
+	u16 qwords = PERF_X86_XMM_QWORDS;
+	static bool has_cap_simd_regs;
+	static bool cached;
+
+	if (cached)
+		return has_cap_simd_regs;
+
+	has_cap_simd_regs = __arch_simd_reg_mask(PERF_SAMPLE_REGS_INTR,
+						 PERF_REG_X86_XMM, &mask, &qwords);
+	has_cap_simd_regs |= __arch_simd_reg_mask(PERF_SAMPLE_REGS_USER,
+						 PERF_REG_X86_XMM, &mask, &qwords);
+	cached = true;
+
+	return has_cap_simd_regs;
+}
+
+
+static const struct sample_reg sample_simd_reg_masks[] = {
+	SMPL_REG(XMM, PERF_REG_X86_XMM),
+	SMPL_REG(YMM, PERF_REG_X86_YMM),
+	SMPL_REG(ZMM, PERF_REG_X86_ZMM),
+	SMPL_REG_END
+};
+
+static const struct sample_reg sample_pred_reg_masks[] = {
+	SMPL_REG(OPMASK, PERF_REG_X86_OPMASK),
+	SMPL_REG_END
+};
+
+const struct sample_reg *arch__sample_simd_reg_masks(void)
+{
+	return sample_simd_reg_masks;
+}
+
+const struct sample_reg *arch__sample_pred_reg_masks(void)
+{
+	return sample_pred_reg_masks;
+}
+
+static bool x86_intr_simd_updated;
+static u64 x86_intr_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
+static u16 x86_intr_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
+static bool x86_user_simd_updated;
+static u64 x86_user_simd_mask[PERF_REG_X86_MAX_SIMD_REGS];
+static u16 x86_user_simd_qwords[PERF_REG_X86_MAX_SIMD_REGS];
+
+static bool x86_intr_pred_updated;
+static u64 x86_intr_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
+static u16 x86_intr_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
+static bool x86_user_pred_updated;
+static u64 x86_user_pred_mask[PERF_REG_X86_MAX_PRED_REGS];
+static u16 x86_user_pred_qwords[PERF_REG_X86_MAX_PRED_REGS];
+
+static uint64_t __arch__simd_reg_mask(u64 sample_type)
+{
+	const struct sample_reg *r = NULL;
+	bool supported;
+	u64 mask = 0;
+	int reg;
+
+	if (!has_cap_simd_regs())
+		return 0;
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		supported = false;
+
+		if (!r->mask)
+			continue;
+		reg = fls64(r->mask) - 1;
+
+		if (reg >= PERF_REG_X86_MAX_SIMD_REGS)
+			break;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			supported = __arch_simd_reg_mask(sample_type, reg,
+							 &x86_intr_simd_mask[reg],
+							 &x86_intr_simd_qwords[reg]);
+		else if (sample_type == PERF_SAMPLE_REGS_USER)
+			supported = __arch_simd_reg_mask(sample_type, reg,
+							 &x86_user_simd_mask[reg],
+							 &x86_user_simd_qwords[reg]);
+		if (supported)
+			mask |= BIT_ULL(reg);
+	}
+
+	if (sample_type == PERF_SAMPLE_REGS_INTR)
+		x86_intr_simd_updated = true;
+	else
+		x86_user_simd_updated = true;
+
+	return mask;
+}
+
+static uint64_t __arch__pred_reg_mask(u64 sample_type)
+{
+	const struct sample_reg *r = NULL;
+	bool supported;
+	u64 mask = 0;
+	int reg;
+
+	if (!has_cap_simd_regs())
+		return 0;
+
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		supported = false;
+
+		if (!r->mask)
+			continue;
+		reg = fls64(r->mask) - 1;
+
+		if (reg >= PERF_REG_X86_MAX_PRED_REGS)
+			break;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			supported = __arch_pred_reg_mask(sample_type, reg,
+							 &x86_intr_pred_mask[reg],
+							 &x86_intr_pred_qwords[reg]);
+		else if (sample_type == PERF_SAMPLE_REGS_USER)
+			supported = __arch_pred_reg_mask(sample_type, reg,
+							 &x86_user_pred_mask[reg],
+							 &x86_user_pred_qwords[reg]);
+		if (supported)
+			mask |= BIT_ULL(reg);
+	}
+
+	if (sample_type == PERF_SAMPLE_REGS_INTR)
+		x86_intr_pred_updated = true;
+	else
+		x86_user_pred_updated = true;
+
+	return mask;
+}
+
+uint64_t arch__intr_simd_reg_mask(void)
+{
+	return __arch__simd_reg_mask(PERF_SAMPLE_REGS_INTR);
+}
+
+uint64_t arch__user_simd_reg_mask(void)
+{
+	return __arch__simd_reg_mask(PERF_SAMPLE_REGS_USER);
+}
+
+uint64_t arch__intr_pred_reg_mask(void)
+{
+	return __arch__pred_reg_mask(PERF_SAMPLE_REGS_INTR);
+}
+
+uint64_t arch__user_pred_reg_mask(void)
+{
+	return __arch__pred_reg_mask(PERF_SAMPLE_REGS_USER);
+}
+
+static uint64_t arch__simd_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
+{
+	uint64_t mask = 0;
+
+	*qwords = 0;
+	if (reg < PERF_REG_X86_MAX_SIMD_REGS) {
+		if (intr) {
+			*qwords = x86_intr_simd_qwords[reg];
+			mask = x86_intr_simd_mask[reg];
+		} else {
+			*qwords = x86_user_simd_qwords[reg];
+			mask = x86_user_simd_mask[reg];
+		}
+	}
+
+	return mask;
+}
+
+static uint64_t arch__pred_reg_bitmap_qwords(int reg, u16 *qwords, bool intr)
+{
+	uint64_t mask = 0;
+
+	*qwords = 0;
+	if (reg < PERF_REG_X86_MAX_PRED_REGS) {
+		if (intr) {
+			*qwords = x86_intr_pred_qwords[reg];
+			mask = x86_intr_pred_mask[reg];
+		} else {
+			*qwords = x86_user_pred_qwords[reg];
+			mask = x86_user_pred_mask[reg];
+		}
+	}
+
+	return mask;
+}
+
+uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_intr_simd_updated)
+		arch__intr_simd_reg_mask();
+	return arch__simd_reg_bitmap_qwords(reg, qwords, true);
+}
+
+uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_user_simd_updated)
+		arch__user_simd_reg_mask();
+	return arch__simd_reg_bitmap_qwords(reg, qwords, false);
+}
+
+uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_intr_pred_updated)
+		arch__intr_pred_reg_mask();
+	return arch__pred_reg_bitmap_qwords(reg, qwords, true);
+}
+
+uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords)
+{
+	if (!x86_user_pred_updated)
+		arch__user_pred_reg_mask();
+	return arch__pred_reg_bitmap_qwords(reg, qwords, false);
+}
+
 const struct sample_reg *arch__sample_reg_masks(void)
 {
+	if (has_cap_simd_regs())
+		return sample_reg_masks_ext;
 	return sample_reg_masks;
 }
 
-uint64_t arch__intr_reg_mask(void)
+static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
 {
 	struct perf_event_attr attr = {
-		.type			= PERF_TYPE_HARDWARE,
-		.config			= PERF_COUNT_HW_CPU_CYCLES,
-		.sample_type		= PERF_SAMPLE_REGS_INTR,
-		.sample_regs_intr	= PERF_REG_EXTENDED_MASK,
-		.precise_ip		= 1,
-		.disabled 		= 1,
-		.exclude_kernel		= 1,
+		.type				= PERF_TYPE_HARDWARE,
+		.config				= PERF_COUNT_HW_CPU_CYCLES,
+		.sample_type			= sample_type,
+		.disabled			= 1,
+		.precise_ip			= 1,
+		.exclude_kernel			= 1,
+		.sample_simd_regs_enabled	= has_simd_regs,
 	};
 	int fd;
 	/*
 	 * In an unnamed union, init it here to build on older gcc versions
 	 */
 	attr.sample_period = 1;
+	if (sample_type == PERF_SAMPLE_REGS_INTR)
+		attr.sample_regs_intr = mask;
+	else
+		attr.sample_regs_user = mask;
 
 	if (perf_pmus__num_core_pmus() > 1) {
 		struct perf_pmu *pmu = NULL;
@@ -318,13 +711,41 @@ uint64_t arch__intr_reg_mask(void)
 	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
 	if (fd != -1) {
 		close(fd);
-		return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
+		return mask;
 	}
 
-	return PERF_REGS_MASK;
+	return 0;
+}
+
+uint64_t arch__intr_reg_mask(void)
+{
+	uint64_t mask = PERF_REGS_MASK;
+
+	if (has_cap_simd_regs()) {
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
+					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
+					 true);
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
+					 BIT_ULL(PERF_REG_X86_SSP),
+					 true);
+	} else
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
+
+	return mask;
 }
 
 uint64_t arch__user_reg_mask(void)
 {
-	return PERF_REGS_MASK;
+	uint64_t mask = PERF_REGS_MASK;
+
+	if (has_cap_simd_regs()) {
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
+					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
+					 true);
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
+					 BIT_ULL(PERF_REG_X86_SSP),
+					 true);
+	}
+
+	return mask;
 }
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d264c143b592..98996e672794 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1387,12 +1387,37 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 		evsel__set_sample_bit(evsel, REGS_INTR);
 	}
 
+	if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
+	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+		/* The pred qwords is to implies the set of SIMD registers is used */
+		if (opts->sample_pred_regs_qwords)
+			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
+		else
+			attr->sample_simd_pred_reg_qwords = 1;
+		attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
+		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
+		attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
+		evsel__set_sample_bit(evsel, REGS_INTR);
+	}
+
 	if (opts->sample_user_regs && !evsel->no_aux_samples &&
 	    !evsel__is_dummy_event(evsel)) {
 		attr->sample_regs_user |= opts->sample_user_regs;
 		evsel__set_sample_bit(evsel, REGS_USER);
 	}
 
+	if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
+	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+		if (opts->sample_pred_regs_qwords)
+			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
+		else
+			attr->sample_simd_pred_reg_qwords = 1;
+		attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
+		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
+		attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
+		evsel__set_sample_bit(evsel, REGS_USER);
+	}
+
 	if (target__has_cpu(&opts->target) || opts->sample_cpu)
 		evsel__set_sample_bit(evsel, CPU);
 
diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
index cda1c620968e..c66d0769096b 100644
--- a/tools/perf/util/parse-regs-options.c
+++ b/tools/perf/util/parse-regs-options.c
@@ -4,19 +4,104 @@
 #include <stdint.h>
 #include <string.h>
 #include <stdio.h>
+#include <linux/bitops.h>
 #include "util/debug.h"
 #include <subcmd/parse-options.h>
 #include "util/perf_regs.h"
 #include "util/parse-regs-options.h"
+#include "record.h"
+
+static void __print_simd_regs(bool intr, uint64_t simd_mask, uint64_t pred_mask)
+{
+	const struct sample_reg *r = NULL;
+	uint64_t bitmap = 0;
+	u16 qwords = 0;
+	int idx;
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		if (r->mask & simd_mask) {
+			idx = fls64(r->mask) - 1;
+			if (intr)
+				bitmap = arch__intr_simd_reg_bitmap_qwords(idx, &qwords);
+			else
+				bitmap = arch__user_simd_reg_bitmap_qwords(idx, &qwords);
+			if (bitmap)
+				fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
+		}
+	}
+
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		if (r->mask & pred_mask) {
+			idx = fls64(r->mask) - 1;
+			if (intr)
+				bitmap = arch__intr_pred_reg_bitmap_qwords(idx, &qwords);
+			else
+				bitmap = arch__user_pred_reg_bitmap_qwords(idx, &qwords);
+			if (bitmap)
+				fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
+		}
+	}
+}
+
+static uint64_t __get_simd_reg_bitmask_qwords(bool intr, char *reg_name, u16 *qwords)
+{
+	const struct sample_reg *r = NULL;
+	uint64_t bitmap = 0;
+	int idx;
+
+	*qwords = 0;
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		if (!strcasecmp(reg_name, r->name)) {
+			if (!fls64(r->mask))
+				continue;
+			idx = fls64(r->mask) - 1;
+			if (intr)
+				bitmap = arch__intr_simd_reg_bitmap_qwords(idx, qwords);
+			else
+				bitmap = arch__user_simd_reg_bitmap_qwords(idx, qwords);
+			break;
+		}
+	}
+
+	return bitmap;
+}
+
+static uint64_t __get_pred_reg_bitmask_qwords(bool intr, char *reg_name, u16 *qwords)
+{
+	const struct sample_reg *r = NULL;
+	uint64_t bitmap = 0;
+	int idx;
+
+	*qwords = 0;
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		if (!strcasecmp(reg_name, r->name)) {
+			if (!fls64(r->mask))
+				continue;
+			idx = fls64(r->mask) - 1;
+			if (intr)
+				bitmap = arch__intr_pred_reg_bitmap_qwords(idx, qwords);
+			else
+				bitmap = arch__user_pred_reg_bitmap_qwords(idx, qwords);
+			break;
+		}
+	}
+
+	return bitmap;
+}
 
 static int
 __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 {
 	uint64_t *mode = (uint64_t *)opt->value;
 	const struct sample_reg *r = NULL;
+	struct record_opts *opts;
 	char *s, *os = NULL, *p;
 	int ret = -1;
 	uint64_t mask;
+	uint64_t simd_mask;
+	uint64_t pred_mask;
+	uint64_t bitmap = 0;
+	u16 qwords = 0;
 
 	if (unset)
 		return 0;
@@ -27,10 +112,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 	if (*mode)
 		return -1;
 
-	if (intr)
+	if (intr) {
+		opts = container_of(opt->value, struct record_opts, sample_intr_regs);
 		mask = arch__intr_reg_mask();
-	else
+		simd_mask = arch__intr_simd_reg_mask();
+		pred_mask = arch__intr_pred_reg_mask();
+	} else {
+		opts = container_of(opt->value, struct record_opts, sample_user_regs);
 		mask = arch__user_reg_mask();
+		simd_mask = arch__user_simd_reg_mask();
+		pred_mask = arch__user_pred_reg_mask();
+	}
 
 	/* str may be NULL in case no arg is passed to -I */
 	if (str) {
@@ -50,10 +142,45 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 					if (r->mask & mask)
 						fprintf(stderr, "%s ", r->name);
 				}
+				if (simd_mask || pred_mask)
+					__print_simd_regs(intr, simd_mask, pred_mask);
+
 				fputc('\n', stderr);
 				/* just printing available regs */
 				goto error;
 			}
+
+			if (simd_mask) {
+				bitmap = __get_simd_reg_bitmask_qwords(intr, s, &qwords);
+
+				/* Just need the highest qwords */
+				if (qwords > opts->sample_vec_regs_qwords) {
+					opts->sample_vec_regs_qwords = qwords;
+					if (intr)
+						opts->sample_intr_vec_regs = bitmap;
+					else
+						opts->sample_user_vec_regs = bitmap;
+				}
+
+				if (bitmap)
+					goto next;
+			}
+			if (pred_mask) {
+				bitmap = __get_pred_reg_bitmask_qwords(intr, s, &qwords);
+
+				/* Just need the highest qwords */
+				if (qwords > opts->sample_pred_regs_qwords) {
+					opts->sample_pred_regs_qwords = qwords;
+					if (intr)
+						opts->sample_intr_pred_regs = bitmap;
+					else
+						opts->sample_user_pred_regs = bitmap;
+				}
+
+				if (bitmap)
+					goto next;
+			}
+
 			for (r = arch__sample_reg_masks(); r->name; r++) {
 				if ((r->mask & mask) && !strcasecmp(s, r->name))
 					break;
@@ -65,7 +192,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 			}
 
 			*mode |= r->mask;
-
+next:
 			if (!p)
 				break;
 
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 66b666d9ce64..fb0366d050cf 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(aux_start_paused, p_unsigned);
 	PRINT_ATTRf(aux_pause, p_unsigned);
 	PRINT_ATTRf(aux_resume, p_unsigned);
+	PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
+	PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
+	PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
+	PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
+	PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
+	PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
 
 	return ret;
 }
diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
index 44b90bbf2d07..107bbf7dbcfe 100644
--- a/tools/perf/util/perf_regs.c
+++ b/tools/perf/util/perf_regs.c
@@ -21,6 +21,50 @@ uint64_t __weak arch__user_reg_mask(void)
 	return 0;
 }
 
+uint64_t __weak arch__intr_simd_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__user_simd_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__intr_pred_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__user_pred_reg_mask(void)
+{
+	return 0;
+}
+
+uint64_t __weak arch__intr_simd_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__user_simd_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__intr_pred_reg_bitmap_qwords(int reg  __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__user_pred_reg_bitmap_qwords(int reg __maybe_unused, u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
 static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG_END
 };
@@ -30,6 +74,16 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
 	return sample_reg_masks;
 }
 
+const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
+{
+	return sample_reg_masks;
+}
+
+const struct sample_reg * __weak arch__sample_pred_reg_masks(void)
+{
+	return sample_reg_masks;
+}
+
 const char *perf_reg_name(int id, const char *arch)
 {
 	const char *reg_name = NULL;
diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
index f2d0736d65cc..cd98f9b9f964 100644
--- a/tools/perf/util/perf_regs.h
+++ b/tools/perf/util/perf_regs.h
@@ -27,6 +27,16 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op);
 uint64_t arch__intr_reg_mask(void);
 uint64_t arch__user_reg_mask(void);
 const struct sample_reg *arch__sample_reg_masks(void);
+const struct sample_reg *arch__sample_simd_reg_masks(void);
+const struct sample_reg *arch__sample_pred_reg_masks(void);
+uint64_t arch__intr_simd_reg_mask(void);
+uint64_t arch__user_simd_reg_mask(void);
+uint64_t arch__intr_pred_reg_mask(void);
+uint64_t arch__user_pred_reg_mask(void);
+uint64_t arch__intr_simd_reg_bitmap_qwords(int reg, u16 *qwords);
+uint64_t arch__user_simd_reg_bitmap_qwords(int reg, u16 *qwords);
+uint64_t arch__intr_pred_reg_bitmap_qwords(int reg, u16 *qwords);
+uint64_t arch__user_pred_reg_bitmap_qwords(int reg, u16 *qwords);
 
 const char *perf_reg_name(int id, const char *arch);
 int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index ea3a6c4657ee..825ffb4cc53f 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -59,7 +59,13 @@ struct record_opts {
 	unsigned int  user_freq;
 	u64	      branch_stack;
 	u64	      sample_intr_regs;
+	u64	      sample_intr_vec_regs;
 	u64	      sample_user_regs;
+	u64	      sample_user_vec_regs;
+	u16	      sample_pred_regs_qwords;
+	u16	      sample_vec_regs_qwords;
+	u16	      sample_intr_pred_regs;
+	u16	      sample_user_pred_regs;
 	u64	      default_interval;
 	u64	      user_interval;
 	size_t	      auxtrace_snapshot_size;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [Patch v4 17/17] perf tools: regs: Support to dump regs for PERF_SAMPLE_REGS_ABI_SIMD
  2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
                   ` (15 preceding siblings ...)
  2025-09-25  6:12 ` [Patch v4 16/17] perf tools: parse-regs: Support the new SIMD format Dapeng Mi
@ 2025-09-25  6:12 ` Dapeng Mi
  16 siblings, 0 replies; 22+ messages in thread
From: Dapeng Mi @ 2025-09-25  6:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

Support the new PERF_SAMPLE_REGS_ABI_SIMD ABI. Dump the data to
perf report -D. Only the superset of the vector registers is displayed
for now.

Example:

 $perf record -e cycles:p -IXMM,YMM,OPMASK,SSP ./test
 $perf report -D
 ... ...
 237538985992962 0x454d0 [0x480]: PERF_RECORD_SAMPLE(IP, 0x1):
 179370/179370: 0xffffffff969627fc period: 124999 addr: 0
 ... intr regs: mask 0x20000000000 ABI 64-bit
 .... SSP   0x0000000000000000
 ... SIMD ABI nr_vectors 32 vector_qwords 4 nr_pred 8 pred_qwords 1
 .... YMM  [0] 0x0000000000004000
 .... YMM  [0] 0x000055e828695270
 .... YMM  [0] 0x0000000000000000
 .... YMM  [0] 0x0000000000000000
 .... YMM  [1] 0x000055e8286990e0
 .... YMM  [1] 0x000055e828698dd0
 .... YMM  [1] 0x0000000000000000
 .... YMM  [1] 0x0000000000000000
 ... ...
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... OPMASK[0] 0x0000000000100221
 .... OPMASK[1] 0x0000000000000020
 .... OPMASK[2] 0x000000007fffffff
 .... OPMASK[3] 0x0000000000000000
 .... OPMASK[4] 0x0000000000000000
 .... OPMASK[5] 0x0000000000000000
 .... OPMASK[6] 0x0000000000000000
 .... OPMASK[7] 0x0000000000000000
 ... ...

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/perf/util/evsel.c                       | 20 +++++
 .../perf/util/perf-regs-arch/perf_regs_x86.c  | 43 ++++++++++
 tools/perf/util/sample.h                      | 10 +++
 tools/perf/util/session.c                     | 78 +++++++++++++++++--
 4 files changed, 143 insertions(+), 8 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 98996e672794..e7404b2e1e24 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -3259,6 +3259,16 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 			regs->mask = mask;
 			regs->regs = (u64 *)array;
 			array = (void *)array + sz;
+
+			if (regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				regs->config = *(u64 *)array;
+				array = (void *)array + sizeof(u64);
+				regs->data = (u64 *)array;
+				sz = (regs->nr_vectors * regs->vector_qwords +
+				      regs->nr_pred * regs->pred_qwords) * sizeof(u64);
+				OVERFLOW_CHECK(array, sz, max_size);
+				array = (void *)array + sz;
+			}
 		}
 	}
 
@@ -3316,6 +3326,16 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 			regs->mask = mask;
 			regs->regs = (u64 *)array;
 			array = (void *)array + sz;
+
+			if (regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				regs->config = *(u64 *)array;
+				array = (void *)array + sizeof(u64);
+				regs->data = (u64 *)array;
+				sz = (regs->nr_vectors * regs->vector_qwords +
+				      regs->nr_pred * regs->pred_qwords) * sizeof(u64);
+				OVERFLOW_CHECK(array, sz, max_size);
+				array = (void *)array + sz;
+			}
 		}
 	}
 
diff --git a/tools/perf/util/perf-regs-arch/perf_regs_x86.c b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
index 708954a9d35d..32dac438b12d 100644
--- a/tools/perf/util/perf-regs-arch/perf_regs_x86.c
+++ b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
@@ -5,6 +5,49 @@
 
 const char *__perf_reg_name_x86(int id)
 {
+	if (id > PERF_REG_X86_R15 && arch__intr_simd_reg_mask()) {
+		switch (id) {
+		case PERF_REG_X86_R16:
+			return "R16";
+		case PERF_REG_X86_R17:
+			return "R17";
+		case PERF_REG_X86_R18:
+			return "R18";
+		case PERF_REG_X86_R19:
+			return "R19";
+		case PERF_REG_X86_R20:
+			return "R20";
+		case PERF_REG_X86_R21:
+			return "R21";
+		case PERF_REG_X86_R22:
+			return "R22";
+		case PERF_REG_X86_R23:
+			return "R23";
+		case PERF_REG_X86_R24:
+			return "R24";
+		case PERF_REG_X86_R25:
+			return "R25";
+		case PERF_REG_X86_R26:
+			return "R26";
+		case PERF_REG_X86_R27:
+			return "R27";
+		case PERF_REG_X86_R28:
+			return "R28";
+		case PERF_REG_X86_R29:
+			return "R29";
+		case PERF_REG_X86_R30:
+			return "R30";
+		case PERF_REG_X86_R31:
+			return "R31";
+		case PERF_REG_X86_SSP:
+			return "SSP";
+		default:
+			return NULL;
+		}
+
+		return NULL;
+	}
+
 	switch (id) {
 	case PERF_REG_X86_AX:
 		return "AX";
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index fae834144ef4..3b247e0e8242 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -12,6 +12,16 @@ struct regs_dump {
 	u64 abi;
 	u64 mask;
 	u64 *regs;
+	union {
+		u64 config;
+		struct {
+			u16 nr_vectors;
+			u16 vector_qwords;
+			u16 nr_pred;
+			u16 pred_qwords;
+		};
+	};
+	u64 *data;
 
 	/* Cached values/mask filled by first register access. */
 	u64 cache_regs[PERF_SAMPLE_REGS_CACHE_SIZE];
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 26ae078278cd..4cf6afa37d79 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -927,18 +927,78 @@ static void regs_dump__printf(u64 mask, u64 *regs, const char *arch)
 	}
 }
 
-static const char *regs_abi[] = {
-	[PERF_SAMPLE_REGS_ABI_NONE] = "none",
-	[PERF_SAMPLE_REGS_ABI_32] = "32-bit",
-	[PERF_SAMPLE_REGS_ABI_64] = "64-bit",
-};
+static void simd_regs_dump__printf(struct regs_dump *regs, bool intr)
+{
+	const char *name = "unknown";
+	const struct sample_reg *r;
+	int i, idx = 0;
+	u16 qwords;
+	int reg_idx;
+
+	if (!(regs->abi & PERF_SAMPLE_REGS_ABI_SIMD))
+		return;
+
+	printf("... SIMD ABI nr_vectors %d vector_qwords %d nr_pred %d pred_qwords %d\n",
+	       regs->nr_vectors, regs->vector_qwords,
+	       regs->nr_pred, regs->pred_qwords);
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		if (!fls64(r->mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
+		if (regs->vector_qwords == qwords) {
+			name = r->name;
+			break;
+		}
+	}
+
+	for (i = 0; i < regs->nr_vectors; i++) {
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		if (regs->vector_qwords > 2) {
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		}
+		if (regs->vector_qwords > 4) {
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		}
+	}
+
+	name = "unknown";
+	for (r = arch__sample_pred_reg_masks(); r->name; r++) {
+		if (!fls64(r->mask))
+			continue;
+		reg_idx = fls64(r->mask) - 1;
+		if (intr)
+			arch__intr_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		else
+			arch__user_pred_reg_bitmap_qwords(reg_idx, &qwords);
+		if (regs->pred_qwords == qwords) {
+			name = r->name;
+			break;
+		}
+	}
+	for (i = 0; i < regs->nr_pred; i++)
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+}
 
 static inline const char *regs_dump_abi(struct regs_dump *d)
 {
-	if (d->abi > PERF_SAMPLE_REGS_ABI_64)
-		return "unknown";
+	if (!d->abi)
+		return "none";
+	if (d->abi & PERF_SAMPLE_REGS_ABI_32)
+		return "32-bit";
+	else if (d->abi & PERF_SAMPLE_REGS_ABI_64)
+		return "64-bit";
 
-	return regs_abi[d->abi];
+	return "unknown";
 }
 
 static void regs__printf(const char *type, struct regs_dump *regs, const char *arch)
@@ -964,6 +1024,7 @@ static void regs_user__printf(struct perf_sample *sample, const char *arch)
 
 	if (user_regs->regs)
 		regs__printf("user", user_regs, arch);
+	simd_regs_dump__printf(user_regs, false);
 }
 
 static void regs_intr__printf(struct perf_sample *sample, const char *arch)
@@ -977,6 +1038,7 @@ static void regs_intr__printf(struct perf_sample *sample, const char *arch)
 
 	if (intr_regs->regs)
 		regs__printf("intr", intr_regs, arch);
+	simd_regs_dump__printf(intr_regs, true);
 }
 
 static void stack_user__printf(struct stack_dump *dump)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Patch v4 03/17] x86/fpu/xstate: Add xsaves_nmi
  2025-09-25  6:11 ` [Patch v4 03/17] x86/fpu/xstate: Add xsaves_nmi Dapeng Mi
@ 2025-09-25 15:07   ` Dave Hansen
  2025-09-28  5:31     ` Mi, Dapeng
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2025-09-25 15:07 UTC (permalink / raw)
  To: Dapeng Mi, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi

On 9/24/25 23:11, Dapeng Mi wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> There is a hardware feature (Intel PEBS XMMs group), which can handle
> XSAVE "snapshots" from random code running. This just provides another
> XSAVE data source at a random time.
> 
> Add an interface to retrieve the actual register contents when the NMI
> hit. The interface is different from the other interfaces of FPU. The
> other mechanisms that deal with xstate try to get something coherent.
> But this interface is *in*coherent. There's no telling what was in the
> registers when a NMI hits. It writes whatever was in the registers when
> the NMI hit. It's the invoker's responsibility to make sure the contents
> are properly filtered before exposing them to the end user.
> 
> The support of the supervisor state components is required. The
> compacted storage format is preferred. So the XSAVES is used.

The changelog here is looking a bit munged from the last time I looked
at it. It's getting a bit hard to read. I'd probably run it through your
favorite LLM (and proofread it after of course) to make it more readable.

Ditto for the comments.

Also, what supervisor components are involved here? Aren't we just
talking about [XYZ]MM's?

> +/**
> + * xsaves_nmi - Save selected components to a kernel xstate buffer in NMI
> + * @xstate:	Pointer to the buffer
> + * @mask:	Feature mask to select the components to save
> + *
> + * The @xstate buffer must be 64 byte aligned.
> + *
> + * Caution: The interface is different from the other interfaces of FPU.
> + * The other mechanisms that deal with xstate try to get something coherent.
> + * But this interface is *in*coherent. There's no telling what was in the
> + * registers when a NMI hits. It writes whatever was in the registers when
> + * the NMI hit.
> + * The only user for the interface is perf_event. There is already a
> + * hardware feature (See Intel PEBS XMMs group), which can handle XSAVE
> + * "snapshots" from random code running. This just provides another XSAVE
> + * data source at a random time.
> + * This function can only be invoked in an NMI. It returns the *ACTUAL*
> + * register contents when the NMI hit.
> + */

First, please use actual paragraphs. This isn't a manpage.

But this whole comment kinda rubs me the wrong way.

For instance, I don't think we need to relitigate the XSAVE architecture
with the "The @xstate buffer must be 64 byte aligned." comment. Even if
we did, that's just silly when you could put a one-liner WARN_ON() in
the function which would be a billion times better than a comment.

I'm not sure what "interfaces of FPU" means. I know it came mostly out
of some earlier mails I wrote. But could we trim this down, please?

We basically want to scare anyone else away that might be tempted to use
this.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch v4 03/17] x86/fpu/xstate: Add xsaves_nmi
  2025-09-25 15:07   ` Dave Hansen
@ 2025-09-28  5:31     ` Mi, Dapeng
  2025-09-29 19:01       ` Dave Hansen
  0 siblings, 1 reply; 22+ messages in thread
From: Mi, Dapeng @ 2025-09-28  5:31 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Dave Hansen, Ian Rogers, Adrian Hunter, Jiri Olsa,
	Alexander Shishkin, Kan Liang, Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi


On 9/25/2025 11:07 PM, Dave Hansen wrote:
> On 9/24/25 23:11, Dapeng Mi wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> There is a hardware feature (Intel PEBS XMMs group), which can handle
>> XSAVE "snapshots" from random code running. This just provides another
>> XSAVE data source at a random time.
>>
>> Add an interface to retrieve the actual register contents when the NMI
>> hit. The interface is different from the other interfaces of FPU. The
>> other mechanisms that deal with xstate try to get something coherent.
>> But this interface is *in*coherent. There's no telling what was in the
>> registers when a NMI hits. It writes whatever was in the registers when
>> the NMI hit. It's the invoker's responsibility to make sure the contents
>> are properly filtered before exposing them to the end user.
>>
>> The support of the supervisor state components is required. The
>> compacted storage format is preferred. So the XSAVES is used.
> The changelog here is looking a bit munged from the last time I looked
> at it. It's getting a bit hard to read. I'd probably run it through your
> favorite LLM (and proofread it after of course) to make it more readable.
>
> Ditto for the comments.

Sure. Thanks.


>
> Also, what supervisor components are involved here? Aren't we just
> talking about [XYZ]MM's?

Besides the SIMD registers [XYZ]MM, the CET_USR (only SSP) and APX eGPRs
would be supported as well.


>
>> +/**
>> + * xsaves_nmi - Save selected components to a kernel xstate buffer in NMI
>> + * @xstate:	Pointer to the buffer
>> + * @mask:	Feature mask to select the components to save
>> + *
>> + * The @xstate buffer must be 64 byte aligned.
>> + *
>> + * Caution: The interface is different from the other interfaces of FPU.
>> + * The other mechanisms that deal with xstate try to get something coherent.
>> + * But this interface is *in*coherent. There's no telling what was in the
>> + * registers when a NMI hits. It writes whatever was in the registers when
>> + * the NMI hit.
>> + * The only user for the interface is perf_event. There is already a
>> + * hardware feature (See Intel PEBS XMMs group), which can handle XSAVE
>> + * "snapshots" from random code running. This just provides another XSAVE
>> + * data source at a random time.
>> + * This function can only be invoked in an NMI. It returns the *ACTUAL*
>> + * register contents when the NMI hit.
>> + */
> First, please use actual paragraphs. This isn't a manpage.
>
> But this whole comment kinda rubs me the wrong way.
>
> For instance, I don't think we need to relitigate the XSAVE architecture
> with the "The @xstate buffer must be 64 byte aligned." comment. Even if
> we did, that's just silly when you could put a one-liner WARN_ON() in
> the function which would be a billion times better than a comment.

Yes, thanks.


>
> I'm not sure what "interfaces of FPU" means. I know it came mostly out
> of some earlier mails I wrote. But could we trim this down, please?

I suppose the "interfaces of FPU" should indicate the xsaves() helper.
Sure. I would rewrite the comments and make it more accurate. Thanks.


>
> We basically want to scare anyone else away that might be tempted to use
> this.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch v4 03/17] x86/fpu/xstate: Add xsaves_nmi
  2025-09-28  5:31     ` Mi, Dapeng
@ 2025-09-29 19:01       ` Dave Hansen
  2025-09-30  2:44         ` Mi, Dapeng
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2025-09-29 19:01 UTC (permalink / raw)
  To: Mi, Dapeng, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Dave Hansen, Ian Rogers,
	Adrian Hunter, Jiri Olsa, Alexander Shishkin, Kan Liang,
	Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi

On 9/27/25 22:31, Mi, Dapeng wrote:
>> Also, what supervisor components are involved here? Aren't we just
>> talking about [XYZ]MM's?
> Besides the SIMD registers [XYZ]MM, the CET_USR (only SSP) and APX eGPRs
> would be supported as well.

We should think long and hard about whether to use XSAVE for the CET
SSP. I'm not convinced it's worth it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch v4 03/17] x86/fpu/xstate: Add xsaves_nmi
  2025-09-29 19:01       ` Dave Hansen
@ 2025-09-30  2:44         ` Mi, Dapeng
  0 siblings, 0 replies; 22+ messages in thread
From: Mi, Dapeng @ 2025-09-30  2:44 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Dave Hansen, Ian Rogers, Adrian Hunter, Jiri Olsa,
	Alexander Shishkin, Kan Liang, Andi Kleen, Eranian Stephane
  Cc: Mark Rutland, broonie, Ravi Bangoria, linux-kernel,
	linux-perf-users, Dapeng Mi


On 9/30/2025 3:01 AM, Dave Hansen wrote:
> On 9/27/25 22:31, Mi, Dapeng wrote:
>>> Also, what supervisor components are involved here? Aren't we just
>>> talking about [XYZ]MM's?
>> Besides the SIMD registers [XYZ]MM, the CET_USR (only SSP) and APX eGPRs
>> would be supported as well.
> We should think long and hard about whether to use XSAVE for the CET
> SSP. I'm not convinced it's worth it.

Yeah, It's indeed inefficient to read a ONLY SSP register by using xsaves
instruction. Do you think if it's safe enough to directly read IA32_PL3_SSP
MSR by using rdmsr instruction?

IMO, It seems good enough to read IA32_PL3_SSP directly with rdmsr in NMI
context since we just need to know the real value in IA32_PL3_SSP when NMI
hits. Thanks.



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-09-30  2:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-25  6:11 [Patch v4 00/17] Support vector and more extended registers in perf Dapeng Mi
2025-09-25  6:11 ` [Patch v4 01/17] perf/x86: Use x86_perf_regs in the x86 nmi handler Dapeng Mi
2025-09-25  6:11 ` [Patch v4 02/17] perf/x86: Setup the regs data Dapeng Mi
2025-09-25  6:11 ` [Patch v4 03/17] x86/fpu/xstate: Add xsaves_nmi Dapeng Mi
2025-09-25 15:07   ` Dave Hansen
2025-09-28  5:31     ` Mi, Dapeng
2025-09-29 19:01       ` Dave Hansen
2025-09-30  2:44         ` Mi, Dapeng
2025-09-25  6:12 ` [Patch v4 04/17] perf: Move has_extended_regs() to header file Dapeng Mi
2025-09-25  6:12 ` [Patch v4 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER Dapeng Mi
2025-09-25  6:12 ` [Patch v4 06/17] perf: Support SIMD registers Dapeng Mi
2025-09-25  6:12 ` [Patch v4 07/17] perf/x86: Move XMM to sample_simd_vec_regs Dapeng Mi
2025-09-25  6:12 ` [Patch v4 08/17] perf/x86: Add YMM into sample_simd_vec_regs Dapeng Mi
2025-09-25  6:12 ` [Patch v4 09/17] perf/x86: Add ZMM " Dapeng Mi
2025-09-25  6:12 ` [Patch v4 10/17] perf/x86: Add OPMASK into sample_simd_pred_reg Dapeng Mi
2025-09-25  6:12 ` [Patch v4 11/17] perf/x86: Add eGPRs into sample_regs Dapeng Mi
2025-09-25  6:12 ` [Patch v4 12/17] perf/x86: Add SSP " Dapeng Mi
2025-09-25  6:12 ` [Patch v4 13/17] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS Dapeng Mi
2025-09-25  6:12 ` [Patch v4 14/17] perf tools: Only support legacy regs for the PT and PERF_REGS_MASK Dapeng Mi
2025-09-25  6:12 ` [Patch v4 15/17] perf tools: headers: Sync with the kernel sources Dapeng Mi
2025-09-25  6:12 ` [Patch v4 16/17] perf tools: parse-regs: Support the new SIMD format Dapeng Mi
2025-09-25  6:12 ` [Patch v4 17/17] perf tools: regs: Support to dump regs for PERF_SAMPLE_REGS_ABI_SIMD Dapeng Mi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).