linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 00/17] Support vector and more extended registers in perf
@ 2025-08-15 21:34 kan.liang
  2025-08-15 21:34 ` [PATCH V3 01/17] perf/x86: Use x86_perf_regs in the x86 nmi handler kan.liang
                   ` (16 more replies)
  0 siblings, 17 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Changes since V2:
- Use the FPU format for the x86_pmu.ext_regs_mask as well
- Add a check before invoking xsaves_nmi()
- Add perf_simd_reg_check() to retrieve the number of available
  registers. If the kernel fails to get the requested registers, e.g.,
  XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
- Add POC perf tool patches

Changes since V1:
- Apply the new interfaces to configure and dump the SIMD registers
- Utilize the existing FPU functions, e.g., xstate_calculate_size,
  get_xsave_addr().

Starting from the Intel Ice Lake, the XMM registers can be collected in
a PEBS record. More registers, e.g., YMM, ZMM, OPMASK, SPP and APX, will
be added in the upcoming Architecture PEBS as well. But it requires the
hardware support.

The patch set provides a software solution to mitigate the hardware
requirement. It utilizes the XSAVES command to retrieve the requested
registers in the overflow handler. The feature isn't limited to the PEBS
event or specific platforms anymore.
The hardware solution (if available) is still preferred, since it has
low overhead (especially with the large PEBS) and is more accurate.

In theory, the solution should work for all X86 platforms. But I only
have newer Inter platforms to test. The patch set only enable the
feature for Intel Ice Lake and later platforms.

The new registers include YMM, ZMM, OPMASK, SSP, and APX.
The sample_regs_user/intr has run out. A new field in the
struct perf_event_attr is required for the registers.

After a long discussion in V1,
https://lore.kernel.org/lkml/3f1c9a9e-cb63-47ff-a5e9-06555fa6cc9a@linux.intel.com/

The new field looks like as below.
@@ -543,6 +545,25 @@ struct perf_event_attr {
        __u64   sig_data;

        __u64   config3; /* extension of config2 */
+
+
+       /*
+        * Defines set of SIMD registers to dump on samples.
+        * The sample_simd_regs_enabled !=0 implies the
+        * set of SIMD registers is used to config all SIMD registers.
+        * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+        * config some SIMD registers on X86.
+        */
+       union {
+               __u16 sample_simd_regs_enabled;
+               __u16 sample_simd_pred_reg_qwords;
+       };
+       __u32 sample_simd_pred_reg_intr;
+       __u32 sample_simd_pred_reg_user;
+       __u16 sample_simd_vec_reg_qwords;
+       __u64 sample_simd_vec_reg_intr;
+       __u64 sample_simd_vec_reg_user;
+       __u32 __reserved_4;
 };
@@ -1016,7 +1037,15 @@ enum perf_event_type {
         *      } && PERF_SAMPLE_BRANCH_STACK
         *
         *      { u64                   abi; # enum perf_sample_regs_abi
-        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;
+        *              u16 vector_qwords;
+        *              u16 nr_pred;
+        *              u16 pred_qwords;
+        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_USER
         *
         *      { u64                   size;
         *        char                  data[size];
@@ -1043,7 +1072,15 @@ enum perf_event_type {
         *      { u64                   data_src; } && PERF_SAMPLE_DATA_SRC
         *      { u64                   transaction; } && PERF_SAMPLE_TRANSACTION
         *      { u64                   abi; # enum perf_sample_regs_abi
-        *        u64                   regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+        *        u64                   regs[weight(mask)];
+        *        struct {
+        *              u16 nr_vectors;
+        *              u16 vector_qwords;
+        *              u16 nr_pred;
+        *              u16 pred_qwords;
+        *              u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+        *        } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+        *      } && PERF_SAMPLE_REGS_INTR
         *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
         *      { u64                   cgroup;} && PERF_SAMPLE_CGROUP
         *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE


Since there is only one vector qwords field, the qwords for the newest
vector should be set by the tools. For example, if the end user wants
XMM0 and YMM1, the vector qwords should be 4. The vector mask should be
0x3. The YMM0 and YMM1 will be dumped to the userspace. It's the tool's
responsibility to output the XMM0 and YMM1 to the end user.

The POC perf tool patches for testing purposes is also attached.

Examples:
 $perf record -I?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7

 $perf record --user-regs=?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7

 $perf record -e cycles:p -IXMM,YMM,OPMASK,SSP ./test
 $perf report -D
 ... ...
 237538985992962 0x454d0 [0x480]: PERF_RECORD_SAMPLE(IP, 0x1):
 179370/179370: 0xffffffff969627fc period: 124999 addr: 0
 ... intr regs: mask 0x20000000000 ABI 64-bit
 .... SSP   0x0000000000000000
 ... SIMD ABI nr_vectors 32 vector_qwords 4 nr_pred 8 pred_qwords 1
 .... YMM  [0] 0x0000000000004000
 .... YMM  [0] 0x000055e828695270
 .... YMM  [0] 0x0000000000000000
 .... YMM  [0] 0x0000000000000000
 .... YMM  [1] 0x000055e8286990e0
 .... YMM  [1] 0x000055e828698dd0
 .... YMM  [1] 0x0000000000000000
 .... YMM  [1] 0x0000000000000000
 ... ...
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... OPMASK[0] 0x0000000000100221
 .... OPMASK[1] 0x0000000000000020
 .... OPMASK[2] 0x000000007fffffff
 .... OPMASK[3] 0x0000000000000000
 .... OPMASK[4] 0x0000000000000000
 .... OPMASK[5] 0x0000000000000000
 .... OPMASK[6] 0x0000000000000000
 .... OPMASK[7] 0x0000000000000000
 ... ...

Kan Liang (17):
  perf/x86: Use x86_perf_regs in the x86 nmi handler
  perf/x86: Setup the regs data
  x86/fpu/xstate: Add xsaves_nmi
  perf: Move has_extended_regs() to header file
  perf/x86: Support XMM register for non-PEBS and REGS_USER
  perf: Support SIMD registers
  perf/x86: Move XMM to sample_simd_vec_regs
  perf/x86: Add YMM into sample_simd_vec_regs
  perf/x86: Add ZMM into sample_simd_vec_regs
  perf/x86: Add OPMASK into sample_simd_pred_reg
  perf/x86: Add eGPRs into sample_regs
  perf/x86: Add SSP into sample_regs
  perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS
  perf/x86/regs: Only support legacy regs for the PT and PERF_REGS_MASK
    for now
  tools headers: Sync with the kernel sources
  perf parse-regs: Support the new SIMD format
  perf regs: Support the PERF_SAMPLE_REGS_ABI_SIMD

 arch/x86/events/core.c                        | 268 +++++++++++++++++-
 arch/x86/events/intel/core.c                  |  73 ++++-
 arch/x86/events/intel/ds.c                    |  12 +-
 arch/x86/events/perf_event.h                  |  23 ++
 arch/x86/include/asm/fpu/xstate.h             |   3 +
 arch/x86/include/asm/perf_event.h             |  30 +-
 arch/x86/include/uapi/asm/perf_regs.h         |  44 ++-
 arch/x86/kernel/fpu/xstate.c                  |  32 ++-
 arch/x86/kernel/perf_regs.c                   | 127 ++++++++-
 include/linux/perf_event.h                    |  21 ++
 include/linux/perf_regs.h                     |   9 +
 include/uapi/linux/perf_event.h               |  47 ++-
 kernel/events/core.c                          | 109 ++++++-
 tools/arch/x86/include/uapi/asm/perf_regs.h   |  44 ++-
 tools/include/uapi/linux/perf_event.h         |  47 ++-
 tools/perf/arch/x86/include/perf_regs.h       |   2 +-
 tools/perf/arch/x86/util/perf_regs.c          | 257 ++++++++++++++++-
 tools/perf/util/evsel.c                       |  41 +++
 tools/perf/util/intel-pt.c                    |   2 +-
 tools/perf/util/parse-regs-options.c          |  60 +++-
 .../perf/util/perf-regs-arch/perf_regs_x86.c  |  45 +++
 tools/perf/util/perf_event_attr_fprintf.c     |   6 +
 tools/perf/util/perf_regs.c                   |  29 ++
 tools/perf/util/perf_regs.h                   |  13 +-
 tools/perf/util/record.h                      |   6 +
 tools/perf/util/sample.h                      |  10 +
 tools/perf/util/session.c                     |  62 +++-
 27 files changed, 1339 insertions(+), 83 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V3 01/17] perf/x86: Use x86_perf_regs in the x86 nmi handler
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-15 21:34 ` [PATCH V3 02/17] perf/x86: Setup the regs data kan.liang
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

More and more regs will be supported in the overflow, e.g., more vector
registers, SSP, etc. The generic pt_regs struct cannot store all of
them. Use a X86 specific x86_perf_regs instead.

The struct pt_regs *regs is still passed to x86_pmu_handle_irq(). There
is no functional change for the existing code.

AMD IBS's NMI handler doesn't utilize the static call
x86_pmu_handle_irq(). The x86_perf_regs struct doesn't apply to the AMD
IBS. It can be added separately later when AMD IBS supports more regs.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7610f26dfbd9..64a7a8aa2e38 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1752,6 +1752,7 @@ void perf_events_lapic_init(void)
 static int
 perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
+	struct x86_perf_regs x86_regs;
 	u64 start_clock;
 	u64 finish_clock;
 	int ret;
@@ -1764,7 +1765,8 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 		return NMI_DONE;
 
 	start_clock = sched_clock();
-	ret = static_call(x86_pmu_handle_irq)(regs);
+	x86_regs.regs = *regs;
+	ret = static_call(x86_pmu_handle_irq)(&x86_regs.regs);
 	finish_clock = sched_clock();
 
 	perf_sample_event_took(finish_clock - start_clock);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 02/17] perf/x86: Setup the regs data
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
  2025-08-15 21:34 ` [PATCH V3 01/17] perf/x86: Use x86_perf_regs in the x86 nmi handler kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-15 21:34 ` [PATCH V3 03/17] x86/fpu/xstate: Add xsaves_nmi kan.liang
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The current code relies on the generic code to setup the regs data.
It will not work well when there are more regs introduced.
Introduce a X86-specific x86_pmu_setup_regs_data().
Now, it's the same as the generic code. More X86-specific codes will be
added later when the new regs.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c       | 32 ++++++++++++++++++++++++++++++++
 arch/x86/events/intel/ds.c   |  4 +++-
 arch/x86/events/perf_event.h |  4 ++++
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 64a7a8aa2e38..c601ad761534 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1685,6 +1685,38 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs)
+{
+	u64 sample_type = event->attr.sample_type;
+
+	if (sample_type & PERF_SAMPLE_REGS_USER) {
+		if (user_mode(regs)) {
+			data->regs_user.abi = perf_reg_abi(current);
+			data->regs_user.regs = regs;
+		} else if (!(current->flags & PF_KTHREAD)) {
+			perf_get_regs_user(&data->regs_user, regs);
+		} else {
+			data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
+			data->regs_user.regs = NULL;
+		}
+		data->dyn_size += sizeof(u64);
+		if (data->regs_user.regs)
+			data->dyn_size += hweight64(event->attr.sample_regs_user) * sizeof(u64);
+		data->sample_flags |= PERF_SAMPLE_REGS_USER;
+	}
+
+	if (sample_type & PERF_SAMPLE_REGS_INTR) {
+		data->regs_intr.regs = regs;
+		data->regs_intr.abi = perf_reg_abi(current);
+		data->dyn_size += sizeof(u64);
+		if (data->regs_intr.regs)
+			data->dyn_size += hweight64(event->attr.sample_regs_intr) * sizeof(u64);
+		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
+	}
+}
+
 int x86_pmu_handle_irq(struct pt_regs *regs)
 {
 	struct perf_sample_data data;
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index c0b7ac1c7594..e67d8a03ddfe 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2126,8 +2126,10 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 			regs->flags &= ~PERF_EFLAGS_EXACT;
 		}
 
-		if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
+		if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
 			adaptive_pebs_save_regs(regs, gprs);
+			x86_pmu_setup_regs_data(event, data, regs);
+		}
 	}
 
 	if (format_group & PEBS_DATACFG_MEMINFO) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 2b969386dcdd..12682a059608 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1278,6 +1278,10 @@ void x86_pmu_enable_event(struct perf_event *event);
 
 int x86_pmu_handle_irq(struct pt_regs *regs);
 
+void x86_pmu_setup_regs_data(struct perf_event *event,
+			     struct perf_sample_data *data,
+			     struct pt_regs *regs);
+
 void x86_pmu_show_pmu_cap(struct pmu *pmu);
 
 static inline int x86_pmu_num_counters(struct pmu *pmu)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 03/17] x86/fpu/xstate: Add xsaves_nmi
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
  2025-08-15 21:34 ` [PATCH V3 01/17] perf/x86: Use x86_perf_regs in the x86 nmi handler kan.liang
  2025-08-15 21:34 ` [PATCH V3 02/17] perf/x86: Setup the regs data kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-15 21:34 ` [PATCH V3 04/17] perf: Move has_extended_regs() to header file kan.liang
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

There is a hardware feature (Intel PEBS XMMs group), which can handle
XSAVE "snapshots" from random code running. This just provides another
XSAVE data source at a random time.

Add an interface to retrieve the actual register contents when the NMI
hit. The interface is different from the other interfaces of FPU. The
other mechanisms that deal with xstate try to get something coherent.
But this interface is *in*coherent. There's no telling what was in the
registers when a NMI hits. It writes whatever was in the registers when
the NMI hit. It's the invoker's responsibility to make sure the contents
are properly filtered before exposing them to the end user.

The support of the supervisor state components is required. The
compacted storage format is preferred. So the XSAVES is used.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/include/asm/fpu/xstate.h |  1 +
 arch/x86/kernel/fpu/xstate.c      | 30 ++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index b308a76afbb7..0c8b9251c29f 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -107,6 +107,7 @@ int xfeature_size(int xfeature_nr);
 
 void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
+void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
 int xfd_enable_feature(u64 xfd_err);
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9aa9ac8399ae..8602683fcb12 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1448,6 +1448,36 @@ void xrstors(struct xregs_state *xstate, u64 mask)
 	WARN_ON_ONCE(err);
 }
 
+/**
+ * xsaves_nmi - Save selected components to a kernel xstate buffer in NMI
+ * @xstate:	Pointer to the buffer
+ * @mask:	Feature mask to select the components to save
+ *
+ * The @xstate buffer must be 64 byte aligned.
+ *
+ * Caution: The interface is different from the other interfaces of FPU.
+ * The other mechanisms that deal with xstate try to get something coherent.
+ * But this interface is *in*coherent. There's no telling what was in the
+ * registers when a NMI hits. It writes whatever was in the registers when
+ * the NMI hit.
+ * The only user for the interface is perf_event. There is already a
+ * hardware feature (See Intel PEBS XMMs group), which can handle XSAVE
+ * "snapshots" from random code running. This just provides another XSAVE
+ * data source at a random time.
+ * This function can only be invoked in an NMI. It returns the *ACTUAL*
+ * register contents when the NMI hit.
+ */
+void xsaves_nmi(struct xregs_state *xstate, u64 mask)
+{
+	int err;
+
+	if (!in_nmi())
+		return;
+
+	XSTATE_OP(XSAVES, xstate, (u32)mask, (u32)(mask >> 32), err);
+	WARN_ON_ONCE(err);
+}
+
 #if IS_ENABLED(CONFIG_KVM)
 void fpstate_clear_xstate_component(struct fpstate *fpstate, unsigned int xfeature)
 {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 04/17] perf: Move has_extended_regs() to header file
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (2 preceding siblings ...)
  2025-08-15 21:34 ` [PATCH V3 03/17] x86/fpu/xstate: Add xsaves_nmi kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-15 21:34 ` [PATCH V3 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER kan.liang
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The function will also be used in the ARCH-specific code.

Rename it to follow the naming rule of the existing functions.

No functional change.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h | 8 ++++++++
 kernel/events/core.c       | 8 +-------
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ec9d96025683..444b162f3f92 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1526,6 +1526,14 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_extended_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return (attr->sample_regs_user & PERF_REG_EXTENDED_MASK) ||
+	       (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK);
+}
+
 static inline bool event_has_any_exclude_flag(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0db36b2b2448..95a7b6f5af09 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12527,12 +12527,6 @@ int perf_pmu_unregister(struct pmu *pmu)
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
-static inline bool has_extended_regs(struct perf_event *event)
-{
-	return (event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK) ||
-	       (event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK);
-}
-
 static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 {
 	struct perf_event_context *ctx = NULL;
@@ -12567,7 +12561,7 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 		goto err_pmu;
 
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
-	    has_extended_regs(event)) {
+	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
 		goto err_destroy;
 	}
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (3 preceding siblings ...)
  2025-08-15 21:34 ` [PATCH V3 04/17] perf: Move has_extended_regs() to header file kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-19 13:39   ` Peter Zijlstra
  2025-08-15 21:34 ` [PATCH V3 06/17] perf: Support SIMD registers kan.liang
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Collecting the XMM registers in a PEBS record has been supported since
the Icelake. But non-PEBS events don't support the feature. It's
possible to retrieve the XMM registers from the XSAVE for non-PEBS.
Add it to make the feature complete.

To utilize the XSAVE, a 64-byte aligned buffer is required. Add a
per-CPU ext_regs_buf to store the vector registers. The size of the
buffer is ~2K. kzalloc_node() is used because there's a _guarantee_
that all kmalloc()'s with powers of 2 are naturally aligned and also
64b aligned.

Extend the support for both REGS_USER and REGS_INTR. For REGS_USER, the
perf_get_regs_user() returns the regs from the task_pt_regs(current),
which is struct pt_regs. Need to move it to local struct x86_perf_regs
x86_user_regs.
For PEBS, the HW support is still preferred. The XMM should be retrieved
from PEBS records.

There could be more vector registers supported later. Add ext_regs_mask
to track the supported vector register group.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c            | 127 +++++++++++++++++++++++++-----
 arch/x86/events/intel/core.c      |  27 +++++++
 arch/x86/events/intel/ds.c        |  10 ++-
 arch/x86/events/perf_event.h      |   9 ++-
 arch/x86/include/asm/fpu/xstate.h |   2 +
 arch/x86/include/asm/perf_event.h |   5 +-
 arch/x86/kernel/fpu/xstate.c      |   2 +-
 7 files changed, 157 insertions(+), 25 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index c601ad761534..f27c58f4c815 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -406,6 +406,61 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event)
 	return x86_pmu_extra_regs(val, event);
 }
 
+static DEFINE_PER_CPU(struct xregs_state *, ext_regs_buf);
+
+static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
+{
+	struct xregs_state *xsave = per_cpu(ext_regs_buf, smp_processor_id());
+	u64 valid_mask = x86_pmu.ext_regs_mask & mask;
+
+	if (WARN_ON_ONCE(!xsave))
+		return;
+
+	xsaves_nmi(xsave, valid_mask);
+
+	/* Filtered by what XSAVE really gives */
+	valid_mask &= xsave->header.xfeatures;
+
+	if (valid_mask & XFEATURE_MASK_SSE)
+		perf_regs->xmm_space = xsave->i387.xmm_space;
+}
+
+static void release_ext_regs_buffers(void)
+{
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		kfree(per_cpu(ext_regs_buf, cpu));
+		per_cpu(ext_regs_buf, cpu) = NULL;
+	}
+}
+
+static void reserve_ext_regs_buffers(void)
+{
+	unsigned int size;
+	int cpu;
+
+	if (!x86_pmu.ext_regs_mask)
+		return;
+
+	size = xstate_calculate_size(x86_pmu.ext_regs_mask, true);
+
+	for_each_possible_cpu(cpu) {
+		per_cpu(ext_regs_buf, cpu) = kzalloc_node(size, GFP_KERNEL,
+							  cpu_to_node(cpu));
+		if (!per_cpu(ext_regs_buf, cpu))
+			goto err;
+	}
+
+	return;
+
+err:
+	release_ext_regs_buffers();
+}
+
 int x86_reserve_hardware(void)
 {
 	int err = 0;
@@ -418,6 +473,7 @@ int x86_reserve_hardware(void)
 			} else {
 				reserve_ds_buffers();
 				reserve_lbr_buffers();
+				reserve_ext_regs_buffers();
 			}
 		}
 		if (!err)
@@ -434,6 +490,7 @@ void x86_release_hardware(void)
 		release_pmc_hardware();
 		release_ds_buffers();
 		release_lbr_buffers();
+		release_ext_regs_buffers();
 		mutex_unlock(&pmc_reserve_mutex);
 	}
 }
@@ -642,21 +699,18 @@ int x86_pmu_hw_config(struct perf_event *event)
 			return -EINVAL;
 	}
 
-	/* sample_regs_user never support XMM registers */
-	if (unlikely(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK))
-		return -EINVAL;
-	/*
-	 * Besides the general purpose registers, XMM registers may
-	 * be collected in PEBS on some platforms, e.g. Icelake
-	 */
-	if (unlikely(event->attr.sample_regs_intr & PERF_REG_EXTENDED_MASK)) {
-		if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
-			return -EINVAL;
-
-		if (!event->attr.precise_ip)
-			return -EINVAL;
+	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
+		/*
+		 * Besides the general purpose registers, XMM registers may
+		 * be collected as well.
+		 */
+		if (event_has_extended_regs(event)) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+				return -EINVAL;
+			if (!(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
+				return -EINVAL;
+		}
 	}
-
 	return x86_setup_perfctr(event);
 }
 
@@ -1685,25 +1739,51 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	static_call_cond(x86_pmu_del)(event);
 }
 
+static DEFINE_PER_CPU(struct x86_perf_regs, x86_user_regs);
+
+static struct x86_perf_regs *
+x86_pmu_perf_get_regs_user(struct perf_sample_data *data,
+			   struct pt_regs *regs)
+{
+	struct x86_perf_regs *x86_regs_user = this_cpu_ptr(&x86_user_regs);
+	struct perf_regs regs_user;
+
+	perf_get_regs_user(&regs_user, regs);
+	data->regs_user.abi = regs_user.abi;
+	if (regs_user.regs) {
+		x86_regs_user->regs = *regs_user.regs;
+		data->regs_user.regs = &x86_regs_user->regs;
+	} else
+		data->regs_user.regs = NULL;
+	return x86_regs_user;
+}
+
 void x86_pmu_setup_regs_data(struct perf_event *event,
 			     struct perf_sample_data *data,
-			     struct pt_regs *regs)
+			     struct pt_regs *regs,
+			     u64 ignore_mask)
 {
-	u64 sample_type = event->attr.sample_type;
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+	struct perf_event_attr *attr = &event->attr;
+	u64 sample_type = attr->sample_type;
+	u64 mask = 0;
+
+	if (!(attr->sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)))
+		return;
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
 		if (user_mode(regs)) {
 			data->regs_user.abi = perf_reg_abi(current);
 			data->regs_user.regs = regs;
 		} else if (!(current->flags & PF_KTHREAD)) {
-			perf_get_regs_user(&data->regs_user, regs);
+			perf_regs = x86_pmu_perf_get_regs_user(data, regs);
 		} else {
 			data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
 			data->regs_user.regs = NULL;
 		}
 		data->dyn_size += sizeof(u64);
 		if (data->regs_user.regs)
-			data->dyn_size += hweight64(event->attr.sample_regs_user) * sizeof(u64);
+			data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
 		data->sample_flags |= PERF_SAMPLE_REGS_USER;
 	}
 
@@ -1712,9 +1792,18 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 		data->regs_intr.abi = perf_reg_abi(current);
 		data->dyn_size += sizeof(u64);
 		if (data->regs_intr.regs)
-			data->dyn_size += hweight64(event->attr.sample_regs_intr) * sizeof(u64);
+			data->dyn_size += hweight64(attr->sample_regs_intr) * sizeof(u64);
 		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
 	}
+
+	if (event_has_extended_regs(event)) {
+		perf_regs->xmm_regs = NULL;
+		mask |= XFEATURE_MASK_SSE;
+	}
+
+	mask &= ~ignore_mask;
+	if (mask)
+		x86_pmu_get_ext_regs(perf_regs, mask);
 }
 
 int x86_pmu_handle_irq(struct pt_regs *regs)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c2fb729c270e..bd16f91dea1c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3284,6 +3284,8 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 		if (has_branch_stack(event))
 			intel_pmu_lbr_save_brstack(&data, cpuc, event);
 
+		x86_pmu_setup_regs_data(event, &data, regs, 0);
+
 		perf_event_overflow(event, &data, regs);
 	}
 
@@ -5272,6 +5274,29 @@ static inline bool intel_pmu_broken_perf_cap(void)
 	return false;
 }
 
+static void intel_extended_regs_init(struct pmu *pmu)
+{
+	/*
+	 * Extend the vector registers support to non-PEBS.
+	 * The feature is limited to newer Intel machines with
+	 * PEBS V4+ or archPerfmonExt (0x23) enabled for now.
+	 * In theory, the vector registers can be retrieved as
+	 * long as the CPU supports. The support for the old
+	 * generations may be added later if there is a
+	 * requirement.
+	 * Only support the extension when XSAVES is available.
+	 */
+	if (!boot_cpu_has(X86_FEATURE_XSAVES))
+		return;
+
+	if (!boot_cpu_has(X86_FEATURE_XMM) ||
+	    !cpu_has_xfeatures(XFEATURE_MASK_SSE, NULL))
+		return;
+
+	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
+	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+}
+
 static void update_pmu_cap(struct pmu *pmu)
 {
 	unsigned int cntr, fixed_cntr, ecx, edx;
@@ -5306,6 +5331,8 @@ static void update_pmu_cap(struct pmu *pmu)
 		/* Perf Metric (Bit 15) and PEBS via PT (Bit 16) are hybrid enumeration */
 		rdmsrq(MSR_IA32_PERF_CAPABILITIES, hybrid(pmu, intel_cap).capabilities);
 	}
+
+	intel_extended_regs_init(pmu);
 }
 
 static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index e67d8a03ddfe..9cdece014ac0 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1415,8 +1415,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
-	    (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
+	if (event_has_extended_regs(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
@@ -2127,8 +2126,12 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		}
 
 		if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
+			u64 mask = 0;
+
 			adaptive_pebs_save_regs(regs, gprs);
-			x86_pmu_setup_regs_data(event, data, regs);
+			if (format_group & PEBS_DATACFG_XMMS)
+				mask |= XFEATURE_MASK_SSE;
+			x86_pmu_setup_regs_data(event, data, regs, mask);
 		}
 	}
 
@@ -2755,6 +2758,7 @@ void __init intel_pebs_init(void)
 				x86_pmu.flags |= PMU_FL_PEBS_ALL;
 				x86_pmu.pebs_capable = ~0ULL;
 				pebs_qual = "-baseline";
+				x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
 				x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
 			} else {
 				/* Only basic record supported */
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 12682a059608..7bf24842b1dc 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -992,6 +992,12 @@ struct x86_pmu {
 	struct extra_reg *extra_regs;
 	unsigned int flags;
 
+	/*
+	 * Extended regs, e.g., vector registers
+	 * Utilize the same format as the XFEATURE_MASK_*
+	 */
+	u64		ext_regs_mask;
+
 	/*
 	 * Intel host/guest support (KVM)
 	 */
@@ -1280,7 +1286,8 @@ int x86_pmu_handle_irq(struct pt_regs *regs);
 
 void x86_pmu_setup_regs_data(struct perf_event *event,
 			     struct perf_sample_data *data,
-			     struct pt_regs *regs);
+			     struct pt_regs *regs,
+			     u64 ignore_mask);
 
 void x86_pmu_show_pmu_cap(struct pmu *pmu);
 
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 0c8b9251c29f..58bbdf9226d1 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -109,6 +109,8 @@ void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
 void xsaves_nmi(struct xregs_state *xsave, u64 mask);
 
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
+
 int xfd_enable_feature(u64 xfd_err);
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 70d1d94aca7e..f36f04bc95f1 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -592,7 +592,10 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
-	u64		*xmm_regs;
+	union {
+		u64	*xmm_regs;
+		u32	*xmm_space;	/* for xsaves */
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 8602683fcb12..4747b29608cd 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -583,7 +583,7 @@ static bool __init check_xstate_against_struct(int nr)
 	return true;
 }
 
-static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
 {
 	unsigned int topmost = fls64(xfeatures) -  1;
 	unsigned int offset, i;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 06/17] perf: Support SIMD registers
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (4 preceding siblings ...)
  2025-08-15 21:34 ` [PATCH V3 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-20  9:55   ` Mi, Dapeng
  2025-08-15 21:34 ` [PATCH V3 07/17] perf/x86: Move XMM to sample_simd_vec_regs kan.liang
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The users may be interested in the SIMD registers in a sample while
profiling. The current sample_regs_XXX doesn't have enough space for all
SIMD registers.

Add sets of the sample_simd_{pred,vec}_reg_* in the
struct perf_event_attr to define a set of SIMD registers to dump on
samples.
The current X86 supports the XMM registers in sample_regs_XXX. To
utilize the new SIMD registers configuration method, the
sample_simd_regs_enabled should always be set. If so, the XMM space in
the sample_regs_XXX is reserved for other usage.

The SIMD registers are wider than 64. A new output format is introduced.
The number and width of SIMD registers will be dumped first, following
the register values. The number and width are the same as the user's
configuration now. If, for some reason (e.g., ARM) they are different,
an ARCH-specific perf_output_sample_simd_regs can be implemented later
separately.
Add a new ABI, PERF_SAMPLE_REGS_ABI_SIMD, to indicate the new format.
The enum perf_sample_regs_abi becomes a bitmap now. There should be no
impact on the existing tool, since the version and bitmap are the same
for 1 and 2.

Add three new __weak functions to retrieve the number of available
registers, validate the configuration of the SIMD registers, and
retrieve the SIMD registers. The ARCH-specific functions will be
implemented in the following patches.

Add a new flag PERF_PMU_CAP_SIMD_REGS to indicate that the PMU has the
capability to support SIMD registers dumping. Error out if the
sample_simd_{pred,vec}_reg_* mistakenly set for a PMU that doesn't have
the capability.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h      |  13 ++++
 include/linux/perf_regs.h       |   9 +++
 include/uapi/linux/perf_event.h |  47 +++++++++++++--
 kernel/events/core.c            | 101 +++++++++++++++++++++++++++++++-
 4 files changed, 162 insertions(+), 8 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 444b162f3f92..205361b7de2e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -305,6 +305,7 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_EXTENDED_HW_TYPE	0x0100
 #define PERF_PMU_CAP_AUX_PAUSE		0x0200
 #define PERF_PMU_CAP_AUX_PREFER_LARGE	0x0400
+#define PERF_PMU_CAP_SIMD_REGS		0x0800
 
 /**
  * pmu::scope
@@ -1526,6 +1527,18 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_simd_regs(struct perf_event *event)
+{
+	struct perf_event_attr *attr = &event->attr;
+
+	return attr->sample_simd_regs_enabled != 0 ||
+	       attr->sample_simd_pred_reg_intr != 0 ||
+	       attr->sample_simd_pred_reg_user != 0 ||
+	       attr->sample_simd_vec_reg_qwords != 0 ||
+	       attr->sample_simd_vec_reg_intr != 0 ||
+	       attr->sample_simd_vec_reg_user != 0;
+}
+
 static inline bool event_has_extended_regs(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index f632c5725f16..0172682b18fd 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -9,6 +9,15 @@ struct perf_regs {
 	struct pt_regs	*regs;
 };
 
+int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+			   u16 pred_qwords, u32 pred_mask);
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred);
+void perf_simd_reg_check(struct pt_regs *regs,
+			 u64 mask, u16 *nr_vectors, u16 *vec_qwords,
+			 u16 pred_mask, u16 *nr_pred, u16 *pred_qwords);
+
+
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 78a362b80027..2e9b16acbed6 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -313,9 +313,10 @@ enum {
  * Values to determine ABI of the registers dump.
  */
 enum perf_sample_regs_abi {
-	PERF_SAMPLE_REGS_ABI_NONE		= 0,
-	PERF_SAMPLE_REGS_ABI_32			= 1,
-	PERF_SAMPLE_REGS_ABI_64			= 2,
+	PERF_SAMPLE_REGS_ABI_NONE		= 0x00,
+	PERF_SAMPLE_REGS_ABI_32			= 0x01,
+	PERF_SAMPLE_REGS_ABI_64			= 0x02,
+	PERF_SAMPLE_REGS_ABI_SIMD		= 0x04,
 };
 
 /*
@@ -382,6 +383,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
 #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
+#define PERF_ATTR_SIZE_VER9			168	/* Add: sample_simd_{pred,vec}_reg_* */
 
 /*
  * 'struct perf_event_attr' contains various attributes that define
@@ -543,6 +545,25 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+
+	/*
+	 * Defines set of SIMD registers to dump on samples.
+	 * The sample_simd_regs_enabled !=0 implies the
+	 * set of SIMD registers is used to config all SIMD registers.
+	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+	 * config some SIMD registers on X86.
+	 */
+	union {
+		__u16 sample_simd_regs_enabled;
+		__u16 sample_simd_pred_reg_qwords;
+	};
+	__u32 sample_simd_pred_reg_intr;
+	__u32 sample_simd_pred_reg_user;
+	__u16 sample_simd_vec_reg_qwords;
+	__u64 sample_simd_vec_reg_intr;
+	__u64 sample_simd_vec_reg_user;
+	__u32 __reserved_4;
 };
 
 /*
@@ -1016,7 +1037,15 @@ enum perf_event_type {
 	 *      } && PERF_SAMPLE_BRANCH_STACK
 	 *
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_USER
 	 *
 	 *	{ u64			size;
 	 *	  char			data[size];
@@ -1043,7 +1072,15 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 95a7b6f5af09..dd8cf3c7fb7a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7408,6 +7408,47 @@ perf_output_sample_regs(struct perf_output_handle *handle,
 	}
 }
 
+static void
+perf_output_sample_simd_regs(struct perf_output_handle *handle,
+			     struct perf_event *event,
+			     struct pt_regs *regs,
+			     u64 mask, u16 pred_mask)
+{
+	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
+	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
+	u16 nr_pred = hweight16(pred_mask);
+	u16 nr_vectors = hweight64(mask);
+	int bit;
+	u64 val;
+	u16 i;
+
+	/* Get the number of available regs */
+	perf_simd_reg_check(regs, mask, &nr_vectors, &vec_qwords,
+			    pred_mask, &nr_pred, &pred_qwords);
+
+	perf_output_put(handle, nr_vectors);
+	perf_output_put(handle, vec_qwords);
+	perf_output_put(handle, nr_pred);
+	perf_output_put(handle, pred_qwords);
+
+	if (nr_vectors) {
+		for_each_set_bit(bit, (unsigned long *)&mask, sizeof(mask) * BITS_PER_BYTE) {
+			for (i = 0; i < vec_qwords; i++) {
+				val = perf_simd_reg_value(regs, bit, i, false);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+	if (nr_pred) {
+		for_each_set_bit(bit, (unsigned long *)&pred_mask, sizeof(pred_mask) * BITS_PER_BYTE) {
+			for (i = 0; i < pred_qwords; i++) {
+				val = perf_simd_reg_value(regs, bit, i, true);
+				perf_output_put(handle, val);
+			}
+		}
+	}
+}
+
 static void perf_sample_regs_user(struct perf_regs *regs_user,
 				  struct pt_regs *regs)
 {
@@ -7429,6 +7470,25 @@ static void perf_sample_regs_intr(struct perf_regs *regs_intr,
 	regs_intr->abi  = perf_reg_abi(current);
 }
 
+int __weak perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+				  u16 pred_qwords, u32 pred_mask)
+{
+	return vec_qwords || vec_mask || pred_qwords || pred_mask ? -ENOSYS : 0;
+}
+
+u64 __weak perf_simd_reg_value(struct pt_regs *regs, int idx,
+			       u16 qwords_idx, bool pred)
+{
+	return 0;
+}
+
+void __weak perf_simd_reg_check(struct pt_regs *regs,
+				u64 mask, u16 *nr_vectors, u16 *vec_qwords,
+				u16 pred_mask, u16 *nr_pred, u16 *pred_qwords)
+{
+	*nr_vectors = 0;
+	*nr_pred = 0;
+}
 
 /*
  * Get remaining task size from user stack pointer.
@@ -7961,10 +8021,17 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_user;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_user;
 			perf_output_sample_regs(handle,
 						data->regs_user.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_user.regs,
+							     attr->sample_simd_vec_reg_user,
+							     attr->sample_simd_pred_reg_user);
+			}
 		}
 	}
 
@@ -7992,11 +8059,18 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, abi);
 
 		if (abi) {
-			u64 mask = event->attr.sample_regs_intr;
+			struct perf_event_attr *attr = &event->attr;
+			u64 mask = attr->sample_regs_intr;
 
 			perf_output_sample_regs(handle,
 						data->regs_intr.regs,
 						mask);
+			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				perf_output_sample_simd_regs(handle, event,
+							     data->regs_intr.regs,
+							     attr->sample_simd_vec_reg_intr,
+							     attr->sample_simd_pred_reg_intr);
+			}
 		}
 	}
 
@@ -12560,6 +12634,12 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 	if (ret)
 		goto err_pmu;
 
+	if (!(pmu->capabilities & PERF_PMU_CAP_SIMD_REGS) &&
+	    event_has_simd_regs(event)) {
+		ret = -EOPNOTSUPP;
+		goto err_destroy;
+	}
+
 	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
 	    event_has_extended_regs(event)) {
 		ret = -EOPNOTSUPP;
@@ -13101,6 +13181,12 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 		ret = perf_reg_validate(attr->sample_regs_user);
 		if (ret)
 			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_user,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_user);
+		if (ret)
+			return ret;
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_STACK_USER) {
@@ -13121,8 +13207,17 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (!attr->sample_max_stack)
 		attr->sample_max_stack = sysctl_perf_event_max_stack;
 
-	if (attr->sample_type & PERF_SAMPLE_REGS_INTR)
+	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
 		ret = perf_reg_validate(attr->sample_regs_intr);
+		if (ret)
+			return ret;
+		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
+					     attr->sample_simd_vec_reg_intr,
+					     attr->sample_simd_pred_reg_qwords,
+					     attr->sample_simd_pred_reg_intr);
+		if (ret)
+			return ret;
+	}
 
 #ifndef CONFIG_CGROUP_PERF
 	if (attr->sample_type & PERF_SAMPLE_CGROUP)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 07/17] perf/x86: Move XMM to sample_simd_vec_regs
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (5 preceding siblings ...)
  2025-08-15 21:34 ` [PATCH V3 06/17] perf: Support SIMD registers kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-15 21:34 ` [PATCH V3 08/17] perf/x86: Add YMM into sample_simd_vec_regs kan.liang
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The XMM0-15 are SIMD registers. Move them from sample_regs to
sample_simd_vec_regs. Reject access to the extended space of the sample_regs
if the new sample_simd_vec_regs is used.

The perf_reg_value requires the abi to understand the layout of the
sample_regs. Add the abi information in the struct x86_perf_regs.

Implement the X86-specific perf_simd_reg_validate to validate the SIMD
registers configuration from the user tool. Only the XMM0-15 is
supported now. More registers will be added in the following patches.
Implement the X86-specific perf_simd_reg_value to retrieve the XMM
value.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c                | 38 ++++++++++++++++-
 arch/x86/events/intel/ds.c            |  2 +-
 arch/x86/events/perf_event.h          | 12 ++++++
 arch/x86/include/asm/perf_event.h     |  1 +
 arch/x86/include/uapi/asm/perf_regs.h |  6 +++
 arch/x86/kernel/perf_regs.c           | 61 ++++++++++++++++++++++++++-
 6 files changed, 117 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f27c58f4c815..1789b91c95c6 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -709,6 +709,22 @@ int x86_pmu_hw_config(struct perf_event *event)
 				return -EINVAL;
 			if (!(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
 				return -EINVAL;
+			if (event->attr.sample_simd_regs_enabled)
+				return -EINVAL;
+		}
+
+		if (event_has_simd_regs(event)) {
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
+				return -EINVAL;
+			/* Not require any vector registers but set width */
+			if (event->attr.sample_simd_vec_reg_qwords &&
+			    !event->attr.sample_simd_vec_reg_intr &&
+			    !event->attr.sample_simd_vec_reg_user)
+				return -EINVAL;
+			/* The vector registers set is not supported */
+			if (event->attr.sample_simd_vec_reg_qwords >= PERF_X86_XMM_QWORDS &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
+				return -EINVAL;
 		}
 	}
 	return x86_setup_perfctr(event);
@@ -1784,6 +1800,16 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 		data->dyn_size += sizeof(u64);
 		if (data->regs_user.regs)
 			data->dyn_size += hweight64(attr->sample_regs_user) * sizeof(u64);
+		if (attr->sample_simd_regs_enabled && data->regs_user.abi) {
+			/* num and qwords of vector and pred registers */
+			data->dyn_size += sizeof(u64);
+			/* data[] */
+			data->dyn_size += hweight64(attr->sample_simd_vec_reg_user) *
+					  sizeof(u64) *
+					  attr->sample_simd_vec_reg_qwords;
+			data->regs_user.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		}
+		perf_regs->abi = data->regs_user.abi;
 		data->sample_flags |= PERF_SAMPLE_REGS_USER;
 	}
 
@@ -1793,10 +1819,20 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 		data->dyn_size += sizeof(u64);
 		if (data->regs_intr.regs)
 			data->dyn_size += hweight64(attr->sample_regs_intr) * sizeof(u64);
+		if (attr->sample_simd_regs_enabled && data->regs_intr.abi) {
+			/* num and qwords of vector and pred registers */
+			data->dyn_size += sizeof(u64);
+			/* data[] */
+			data->dyn_size += hweight64(attr->sample_simd_vec_reg_intr) *
+					  sizeof(u64) *
+					  attr->sample_simd_vec_reg_qwords;
+			data->regs_intr.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
+		}
+		perf_regs->abi = data->regs_intr.abi;
 		data->sample_flags |= PERF_SAMPLE_REGS_INTR;
 	}
 
-	if (event_has_extended_regs(event)) {
+	if (event_needs_xmm(event)) {
 		perf_regs->xmm_regs = NULL;
 		mask |= XFEATURE_MASK_SSE;
 	}
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 9cdece014ac0..4887f6ea7dde 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1415,7 +1415,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if (event_has_extended_regs(event))
+	if (event_needs_xmm(event))
 		pebs_data_cfg |= PEBS_DATACFG_XMMS;
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7bf24842b1dc..6f22ed718a75 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -133,6 +133,18 @@ static inline bool is_acr_event_group(struct perf_event *event)
 	return check_leader_group(event->group_leader, PERF_X86_EVENT_ACR);
 }
 
+static inline bool event_needs_xmm(struct perf_event *event)
+{
+	if (event->attr.sample_simd_regs_enabled &&
+	    event->attr.sample_simd_vec_reg_qwords >= PERF_X86_XMM_QWORDS)
+		return true;
+
+	if (!event->attr.sample_simd_regs_enabled &&
+	    event_has_extended_regs(event))
+		return true;
+	return false;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index f36f04bc95f1..538219c59979 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -592,6 +592,7 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
+	u64		abi;
 	union {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..bd8af802f757 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -55,4 +55,10 @@ enum perf_event_x86_regs {
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
 
+#define PERF_X86_SIMD_VEC_REGS_MAX	16
+#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
+
+#define PERF_X86_XMM_QWORDS		2
+#define PERF_X86_SIMD_QWORDS_MAX	PERF_X86_XMM_QWORDS
+
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 624703af80a1..397357c5896b 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -57,12 +57,27 @@ static unsigned int pt_regs_offset[PERF_REG_X86_MAX] = {
 #endif
 };
 
+void perf_simd_reg_check(struct pt_regs *regs,
+			 u64 mask, u16 *nr_vectors, u16 *vec_qwords,
+			 u16 pred_mask, u16 *nr_pred, u16 *pred_qwords)
+{
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+	if (*vec_qwords >= PERF_X86_XMM_QWORDS && !perf_regs->xmm_regs)
+		*nr_vectors = 0;
+
+	*nr_pred = 0;
+}
+
 u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
 	struct x86_perf_regs *perf_regs;
 
 	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
+			return 0;
 		if (!perf_regs->xmm_regs)
 			return 0;
 		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
@@ -74,6 +89,49 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
+u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
+			u16 qwords_idx, bool pred)
+{
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+	if (pred)
+		return 0;
+
+	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
+			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
+		return 0;
+
+	if (qwords_idx < PERF_X86_XMM_QWORDS) {
+		if (!perf_regs->xmm_regs)
+			return 0;
+		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS + qwords_idx];
+	}
+
+	return 0;
+}
+
+int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
+			   u16 pred_qwords, u32 pred_mask)
+{
+	/* pred_qwords implies sample_simd_{pred,vec}_reg_* are supported */
+	if (!pred_qwords)
+		return 0;
+
+	if (!vec_qwords) {
+		if (vec_mask)
+			return -EINVAL;
+	} else {
+		if (vec_qwords != PERF_X86_XMM_QWORDS)
+			return -EINVAL;
+		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
+			return -EINVAL;
+	}
+	if (pred_mask)
+		return -EINVAL;
+
+	return 0;
+}
+
 #define PERF_REG_X86_RESERVED	(((1ULL << PERF_REG_X86_XMM0) - 1) & \
 				 ~((1ULL << PERF_REG_X86_MAX) - 1))
 
@@ -114,7 +172,8 @@ void perf_get_regs_user(struct perf_regs *regs_user,
 
 int perf_reg_validate(u64 mask)
 {
-	if (!mask || (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED)))
+	/* The mask could be 0 if only the SIMD registers are interested */
+	if (mask & (REG_NOSUPPORT | PERF_REG_X86_RESERVED))
 		return -EINVAL;
 
 	return 0;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 08/17] perf/x86: Add YMM into sample_simd_vec_regs
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (6 preceding siblings ...)
  2025-08-15 21:34 ` [PATCH V3 07/17] perf/x86: Move XMM to sample_simd_vec_regs kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-20  9:59   ` Mi, Dapeng
  2025-08-15 21:34 ` [PATCH V3 09/17] perf/x86: Add ZMM " kan.liang
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The YMM0-15 is composed of XMM and YMMH. It requires 2 XSAVE commands to
get the complete value. Internally, the XMM and YMMH are stored in
different structures, which follow the XSAVE format. But the output
dumps the YMM as a whole.

The qwords 4 imply YMM.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c                | 13 +++++++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  4 +++-
 arch/x86/kernel/perf_regs.c           | 10 +++++++++-
 4 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1789b91c95c6..aebd4e56dff1 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -423,6 +423,9 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 
 	if (valid_mask & XFEATURE_MASK_SSE)
 		perf_regs->xmm_space = xsave->i387.xmm_space;
+
+	if (valid_mask & XFEATURE_MASK_YMM)
+		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
 }
 
 static void release_ext_regs_buffers(void)
@@ -725,6 +728,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event->attr.sample_simd_vec_reg_qwords >= PERF_X86_XMM_QWORDS &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
 				return -EINVAL;
+			if (event->attr.sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
+				return -EINVAL;
 		}
 	}
 	return x86_setup_perfctr(event);
@@ -1837,6 +1843,13 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 		mask |= XFEATURE_MASK_SSE;
 	}
 
+	if (attr->sample_simd_regs_enabled) {
+		if (attr->sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS) {
+			perf_regs->ymmh_regs = NULL;
+			mask |= XFEATURE_MASK_YMM;
+		}
+	}
+
 	mask &= ~ignore_mask;
 	if (mask)
 		x86_pmu_get_ext_regs(perf_regs, mask);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 538219c59979..81e3143fd91a 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -597,6 +597,10 @@ struct x86_perf_regs {
 		u64	*xmm_regs;
 		u32	*xmm_space;	/* for xsaves */
 	};
+	union {
+		u64	*ymmh_regs;
+		struct ymmh_struct *ymmh;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index bd8af802f757..feb3e8f80761 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -59,6 +59,8 @@ enum perf_event_x86_regs {
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
 #define PERF_X86_XMM_QWORDS		2
-#define PERF_X86_SIMD_QWORDS_MAX	PERF_X86_XMM_QWORDS
+#define PERF_X86_YMM_QWORDS		4
+#define PERF_X86_YMMH_QWORDS		(PERF_X86_YMM_QWORDS / 2)
+#define PERF_X86_SIMD_QWORDS_MAX	PERF_X86_YMM_QWORDS
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 397357c5896b..d94bc687e4bf 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -66,6 +66,9 @@ void perf_simd_reg_check(struct pt_regs *regs,
 	if (*vec_qwords >= PERF_X86_XMM_QWORDS && !perf_regs->xmm_regs)
 		*nr_vectors = 0;
 
+	if (*vec_qwords >= PERF_X86_YMM_QWORDS && !perf_regs->xmm_regs)
+		*vec_qwords = PERF_X86_XMM_QWORDS;
+
 	*nr_pred = 0;
 }
 
@@ -105,6 +108,10 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 		if (!perf_regs->xmm_regs)
 			return 0;
 		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS + qwords_idx];
+	} else if (qwords_idx < PERF_X86_YMM_QWORDS) {
+		if (!perf_regs->ymmh_regs)
+			return 0;
+		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS + qwords_idx - PERF_X86_XMM_QWORDS];
 	}
 
 	return 0;
@@ -121,7 +128,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 		if (vec_mask)
 			return -EINVAL;
 	} else {
-		if (vec_qwords != PERF_X86_XMM_QWORDS)
+		if (vec_qwords != PERF_X86_XMM_QWORDS &&
+		    vec_qwords != PERF_X86_YMM_QWORDS)
 			return -EINVAL;
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 09/17] perf/x86: Add ZMM into sample_simd_vec_regs
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (7 preceding siblings ...)
  2025-08-15 21:34 ` [PATCH V3 08/17] perf/x86: Add YMM into sample_simd_vec_regs kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-15 21:34 ` [PATCH V3 10/17] perf/x86: Add OPMASK into sample_simd_pred_reg kan.liang
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The ZMM0-15 is composed of XMM, YMMH, and ZMMH. It requires 3 XSAVE
commands to get the complete value.
The ZMM16-31/YMM16-31/XMM16-31 are also supported, which only require
the XSAVE Hi16_ZMM.

Internally, the XMM, YMMH, ZMMH and Hi16_ZMM are stored in different
structures, which follow the XSAVE format. But the output dumps the ZMM
or Hi16 XMM/YMM/ZMM as a whole.

The qwords 8 imply ZMM.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c                | 20 ++++++++++++++++++++
 arch/x86/include/asm/perf_event.h     |  8 ++++++++
 arch/x86/include/uapi/asm/perf_regs.h |  8 ++++++--
 arch/x86/kernel/perf_regs.c           | 19 ++++++++++++++++++-
 4 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index aebd4e56dff1..85b739fe1693 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -426,6 +426,10 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 
 	if (valid_mask & XFEATURE_MASK_YMM)
 		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
+	if (valid_mask & XFEATURE_MASK_ZMM_Hi256)
+		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
+	if (valid_mask & XFEATURE_MASK_Hi16_ZMM)
+		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 }
 
 static void release_ext_regs_buffers(void)
@@ -731,6 +735,13 @@ int x86_pmu_hw_config(struct perf_event *event)
 			if (event->attr.sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS &&
 			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
 				return -EINVAL;
+			if (event->attr.sample_simd_vec_reg_qwords >= PERF_X86_ZMM_QWORDS &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_ZMM_Hi256))
+				return -EINVAL;
+			if ((fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+			     fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE) &&
+			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_Hi16_ZMM))
+				return -EINVAL;
 		}
 	}
 	return x86_setup_perfctr(event);
@@ -1848,6 +1859,15 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 			perf_regs->ymmh_regs = NULL;
 			mask |= XFEATURE_MASK_YMM;
 		}
+		if (attr->sample_simd_vec_reg_qwords >= PERF_X86_ZMM_QWORDS) {
+			perf_regs->zmmh_regs = NULL;
+			mask |= XFEATURE_MASK_ZMM_Hi256;
+		}
+		if (fls64(attr->sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+		    fls64(attr->sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE) {
+			perf_regs->h16zmm_regs = NULL;
+			mask |= XFEATURE_MASK_Hi16_ZMM;
+		}
 	}
 
 	mask &= ~ignore_mask;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 81e3143fd91a..2d78bd9649bd 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -601,6 +601,14 @@ struct x86_perf_regs {
 		u64	*ymmh_regs;
 		struct ymmh_struct *ymmh;
 	};
+	union {
+		u64	*zmmh_regs;
+		struct avx_512_zmm_uppers_state *zmmh;
+	};
+	union {
+		u64	*h16zmm_regs;
+		struct avx_512_hi16_state *h16zmm;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index feb3e8f80761..f74e3ba65be2 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -55,12 +55,16 @@ enum perf_event_x86_regs {
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
 
-#define PERF_X86_SIMD_VEC_REGS_MAX	16
+#define PERF_X86_SIMD_VEC_REGS_MAX	32
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
+#define PERF_X86_H16ZMM_BASE		16
+
 #define PERF_X86_XMM_QWORDS		2
 #define PERF_X86_YMM_QWORDS		4
 #define PERF_X86_YMMH_QWORDS		(PERF_X86_YMM_QWORDS / 2)
-#define PERF_X86_SIMD_QWORDS_MAX	PERF_X86_YMM_QWORDS
+#define PERF_X86_ZMM_QWORDS		8
+#define PERF_X86_ZMMH_QWORDS		(PERF_X86_ZMM_QWORDS / 2)
+#define PERF_X86_SIMD_QWORDS_MAX	PERF_X86_ZMM_QWORDS
 
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index d94bc687e4bf..f04c44d3d356 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -69,6 +69,12 @@ void perf_simd_reg_check(struct pt_regs *regs,
 	if (*vec_qwords >= PERF_X86_YMM_QWORDS && !perf_regs->xmm_regs)
 		*vec_qwords = PERF_X86_XMM_QWORDS;
 
+	if (*vec_qwords >= PERF_X86_ZMM_QWORDS && !perf_regs->zmmh_regs)
+		*vec_qwords = PERF_X86_YMM_QWORDS;
+
+	if (*nr_vectors > PERF_X86_H16ZMM_BASE && !perf_regs->h16zmm_regs)
+		*nr_vectors = PERF_X86_H16ZMM_BASE;
+
 	*nr_pred = 0;
 }
 
@@ -104,6 +110,12 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
 		return 0;
 
+	if (idx >= PERF_X86_H16ZMM_BASE) {
+		if (!perf_regs->h16zmm_regs)
+			return 0;
+		return perf_regs->h16zmm_regs[idx * PERF_X86_ZMM_QWORDS + qwords_idx];
+	}
+
 	if (qwords_idx < PERF_X86_XMM_QWORDS) {
 		if (!perf_regs->xmm_regs)
 			return 0;
@@ -112,6 +124,10 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 		if (!perf_regs->ymmh_regs)
 			return 0;
 		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS + qwords_idx - PERF_X86_XMM_QWORDS];
+	} else if (qwords_idx < PERF_X86_ZMM_QWORDS) {
+		if (!perf_regs->zmmh_regs)
+			return 0;
+		return perf_regs->zmmh_regs[idx * PERF_X86_ZMMH_QWORDS + qwords_idx - PERF_X86_YMM_QWORDS];
 	}
 
 	return 0;
@@ -129,7 +145,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 			return -EINVAL;
 	} else {
 		if (vec_qwords != PERF_X86_XMM_QWORDS &&
-		    vec_qwords != PERF_X86_YMM_QWORDS)
+		    vec_qwords != PERF_X86_YMM_QWORDS &&
+		    vec_qwords != PERF_X86_ZMM_QWORDS)
 			return -EINVAL;
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 10/17] perf/x86: Add OPMASK into sample_simd_pred_reg
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (8 preceding siblings ...)
  2025-08-15 21:34 ` [PATCH V3 09/17] perf/x86: Add ZMM " kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-15 21:34 ` [PATCH V3 11/17] perf/x86: Add eGPRs into sample_regs kan.liang
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The OPMASK is the SIMD's predicate registers. Add them into
sample_simd_pred_reg. The qwords of OPMASK is 1. There are 8 registers.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c                | 13 +++++++++++++
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  3 +++
 arch/x86/kernel/perf_regs.c           | 18 ++++++++++++++----
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 85b739fe1693..1fa550efcdfa 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -430,6 +430,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->zmmh = get_xsave_addr(xsave, XFEATURE_ZMM_Hi256);
 	if (valid_mask & XFEATURE_MASK_Hi16_ZMM)
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
+	if (valid_mask & XFEATURE_MASK_OPMASK)
+		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 }
 
 static void release_ext_regs_buffers(void)
@@ -1824,6 +1826,9 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 			data->dyn_size += hweight64(attr->sample_simd_vec_reg_user) *
 					  sizeof(u64) *
 					  attr->sample_simd_vec_reg_qwords;
+			data->dyn_size += hweight32(attr->sample_simd_pred_reg_user) *
+					  sizeof(u64) *
+					  attr->sample_simd_pred_reg_qwords;
 			data->regs_user.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
 		}
 		perf_regs->abi = data->regs_user.abi;
@@ -1843,6 +1848,9 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 			data->dyn_size += hweight64(attr->sample_simd_vec_reg_intr) *
 					  sizeof(u64) *
 					  attr->sample_simd_vec_reg_qwords;
+			data->dyn_size += hweight32(attr->sample_simd_pred_reg_intr) *
+					  sizeof(u64) *
+					  attr->sample_simd_pred_reg_qwords;
 			data->regs_intr.abi |= PERF_SAMPLE_REGS_ABI_SIMD;
 		}
 		perf_regs->abi = data->regs_intr.abi;
@@ -1868,6 +1876,11 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 			perf_regs->h16zmm_regs = NULL;
 			mask |= XFEATURE_MASK_Hi16_ZMM;
 		}
+		if (attr->sample_simd_pred_reg_intr ||
+		    attr->sample_simd_pred_reg_user) {
+			perf_regs->opmask_regs = NULL;
+			mask |= XFEATURE_MASK_OPMASK;
+		}
 	}
 
 	mask &= ~ignore_mask;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 2d78bd9649bd..dda677022882 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -609,6 +609,10 @@ struct x86_perf_regs {
 		u64	*h16zmm_regs;
 		struct avx_512_hi16_state *h16zmm;
 	};
+	union {
+		u64	*opmask_regs;
+		struct avx_512_opmask_state *opmask;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index f74e3ba65be2..dd7bd1dd8d39 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -55,11 +55,14 @@ enum perf_event_x86_regs {
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
 
+#define PERF_X86_SIMD_PRED_REGS_MAX	8
+#define PERF_X86_SIMD_PRED_MASK		GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
 #define PERF_X86_SIMD_VEC_REGS_MAX	32
 #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
 
 #define PERF_X86_H16ZMM_BASE		16
 
+#define PERF_X86_OPMASK_QWORDS		1
 #define PERF_X86_XMM_QWORDS		2
 #define PERF_X86_YMM_QWORDS		4
 #define PERF_X86_YMMH_QWORDS		(PERF_X86_YMM_QWORDS / 2)
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index f04c44d3d356..5e815f806605 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -75,7 +75,8 @@ void perf_simd_reg_check(struct pt_regs *regs,
 	if (*nr_vectors > PERF_X86_H16ZMM_BASE && !perf_regs->h16zmm_regs)
 		*nr_vectors = PERF_X86_H16ZMM_BASE;
 
-	*nr_pred = 0;
+	if (*nr_pred && !perf_regs->opmask_regs)
+		*nr_pred = 0;
 }
 
 u64 perf_reg_value(struct pt_regs *regs, int idx)
@@ -103,8 +104,14 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
 {
 	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
 
-	if (pred)
-		return 0;
+	if (pred) {
+		if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_PRED_REGS_MAX ||
+				 qwords_idx >= PERF_X86_OPMASK_QWORDS))
+			return 0;
+		if (!perf_regs->opmask_regs)
+			return 0;
+		return perf_regs->opmask_regs[idx];
+	}
 
 	if (WARN_ON_ONCE(idx >= PERF_X86_SIMD_VEC_REGS_MAX ||
 			 qwords_idx >= PERF_X86_SIMD_QWORDS_MAX))
@@ -151,7 +158,10 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
 			return -EINVAL;
 	}
-	if (pred_mask)
+
+	if (pred_qwords != PERF_X86_OPMASK_QWORDS)
+		return -EINVAL;
+	if (pred_mask & ~PERF_X86_SIMD_PRED_MASK)
 		return -EINVAL;
 
 	return 0;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 11/17] perf/x86: Add eGPRs into sample_regs
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (9 preceding siblings ...)
  2025-08-15 21:34 ` [PATCH V3 10/17] perf/x86: Add OPMASK into sample_simd_pred_reg kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-20 10:01   ` Mi, Dapeng
  2025-08-15 21:34 ` [PATCH V3 12/17] perf/x86: Add SSP " kan.liang
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The eGPRs is only supported when the new SIMD registers configuration
method is used, which moves the XMM to sample_simd_vec_regs. So the
space can be reclaimed for the eGPRs.

The eGPRs is retrieved by XSAVE. Only support the eGPRs for X86_64.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c                | 39 +++++++++++++++++++++------
 arch/x86/include/asm/perf_event.h     |  4 +++
 arch/x86/include/uapi/asm/perf_regs.h | 26 ++++++++++++++++--
 arch/x86/kernel/perf_regs.c           | 31 ++++++++++-----------
 4 files changed, 75 insertions(+), 25 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1fa550efcdfa..f816290defc1 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -432,6 +432,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
 	if (valid_mask & XFEATURE_MASK_OPMASK)
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
+	if (valid_mask & XFEATURE_MASK_APX)
+		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
 }
 
 static void release_ext_regs_buffers(void)
@@ -709,17 +711,33 @@ int x86_pmu_hw_config(struct perf_event *event)
 	}
 
 	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
-		/*
-		 * Besides the general purpose registers, XMM registers may
-		 * be collected as well.
-		 */
-		if (event_has_extended_regs(event)) {
-			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+		if (event->attr.sample_simd_regs_enabled) {
+			u64 reserved = ~GENMASK_ULL(PERF_REG_X86_64_MAX - 1, 0);
+
+			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
 				return -EINVAL;
-			if (!(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
+			/*
+			 * The XMM space in the perf_event_x86_regs is reclaimed
+			 * for eGPRs and other general registers.
+			 */
+			if (event->attr.sample_regs_user & reserved ||
+			    event->attr.sample_regs_intr & reserved)
 				return -EINVAL;
-			if (event->attr.sample_simd_regs_enabled)
+			if ((event->attr.sample_regs_user & PERF_X86_EGPRS_MASK ||
+			     event->attr.sample_regs_intr & PERF_X86_EGPRS_MASK) &&
+			     !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
 				return -EINVAL;
+		} else {
+			/*
+			 * Besides the general purpose registers, XMM registers may
+			 * be collected as well.
+			 */
+			if (event_has_extended_regs(event)) {
+				if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
+					return -EINVAL;
+				if (!(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
+					return -EINVAL;
+			}
 		}
 
 		if (event_has_simd_regs(event)) {
@@ -1881,6 +1899,11 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 			perf_regs->opmask_regs = NULL;
 			mask |= XFEATURE_MASK_OPMASK;
 		}
+		if (attr->sample_regs_user & PERF_X86_EGPRS_MASK ||
+		    attr->sample_regs_intr & PERF_X86_EGPRS_MASK) {
+			perf_regs->egpr_regs = NULL;
+			mask |= XFEATURE_MASK_APX;
+		}
 	}
 
 	mask &= ~ignore_mask;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index dda677022882..4400cb66bc8e 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -613,6 +613,10 @@ struct x86_perf_regs {
 		u64	*opmask_regs;
 		struct avx_512_opmask_state *opmask;
 	};
+	union {
+		u64	*egpr_regs;
+		struct apx_state *egpr;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index dd7bd1dd8d39..cd0f6804debf 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,11 +27,31 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/* Extended GPRs (EGPRs) */
+	PERF_REG_X86_R16,
+	PERF_REG_X86_R17,
+	PERF_REG_X86_R18,
+	PERF_REG_X86_R19,
+	PERF_REG_X86_R20,
+	PERF_REG_X86_R21,
+	PERF_REG_X86_R22,
+	PERF_REG_X86_R23,
+	PERF_REG_X86_R24,
+	PERF_REG_X86_R25,
+	PERF_REG_X86_R26,
+	PERF_REG_X86_R27,
+	PERF_REG_X86_R28,
+	PERF_REG_X86_R29,
+	PERF_REG_X86_R30,
+	PERF_REG_X86_R31,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
-	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	PERF_REG_X86_64_MAX = PERF_REG_X86_R31 + 1,
 
-	/* These all need two bits set because they are 128bit */
+	/*
+	 * These all need two bits set because they are 128bit.
+	 * These are only available when !PERF_SAMPLE_REGS_ABI_SIMD
+	 */
 	PERF_REG_X86_XMM0  = 32,
 	PERF_REG_X86_XMM1  = 34,
 	PERF_REG_X86_XMM2  = 36,
@@ -55,6 +75,8 @@ enum perf_event_x86_regs {
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
 
+#define PERF_X86_EGPRS_MASK		GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
+
 #define PERF_X86_SIMD_PRED_REGS_MAX	8
 #define PERF_X86_SIMD_PRED_MASK		GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
 #define PERF_X86_SIMD_VEC_REGS_MAX	32
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 5e815f806605..b6e50194ff3e 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -83,14 +83,22 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
 	struct x86_perf_regs *perf_regs;
 
-	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+	if (idx > PERF_REG_X86_R15) {
 		perf_regs = container_of(regs, struct x86_perf_regs, regs);
-		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
-		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
-			return 0;
-		if (!perf_regs->xmm_regs)
-			return 0;
-		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+
+		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+			if (idx <= PERF_REG_X86_R31) {
+				if (!perf_regs->egpr_regs)
+					return 0;
+				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
+			}
+		} else {
+			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
+				if (!perf_regs->xmm_regs)
+					return 0;
+				return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
+			}
+		}
 	}
 
 	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
@@ -171,14 +179,7 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 				 ~((1ULL << PERF_REG_X86_MAX) - 1))
 
 #ifdef CONFIG_X86_32
-#define REG_NOSUPPORT ((1ULL << PERF_REG_X86_R8) | \
-		       (1ULL << PERF_REG_X86_R9) | \
-		       (1ULL << PERF_REG_X86_R10) | \
-		       (1ULL << PERF_REG_X86_R11) | \
-		       (1ULL << PERF_REG_X86_R12) | \
-		       (1ULL << PERF_REG_X86_R13) | \
-		       (1ULL << PERF_REG_X86_R14) | \
-		       (1ULL << PERF_REG_X86_R15))
+#define REG_NOSUPPORT GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R8)
 
 int perf_reg_validate(u64 mask)
 {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 12/17] perf/x86: Add SSP into sample_regs
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (10 preceding siblings ...)
  2025-08-15 21:34 ` [PATCH V3 11/17] perf/x86: Add eGPRs into sample_regs kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-15 21:34 ` [PATCH V3 13/17] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS kan.liang
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The SSP is only supported when the new SIMD registers configuration
method is used, which moves the XMM to sample_simd_vec_regs. So the
space can be reclaimed for the SSP.

The SSP is retrieved by XSAVE. Only support the SSP for X86_64.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c                | 14 +++++++++++++-
 arch/x86/include/asm/perf_event.h     |  4 ++++
 arch/x86/include/uapi/asm/perf_regs.h |  3 +++
 arch/x86/kernel/perf_regs.c           |  8 +++++++-
 4 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f816290defc1..b0c8b24975cb 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -434,6 +434,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
 		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
 	if (valid_mask & XFEATURE_MASK_APX)
 		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
+	if (valid_mask & XFEATURE_MASK_CET_USER)
+		perf_regs->cet = get_xsave_addr(xsave, XFEATURE_CET_USER);
 }
 
 static void release_ext_regs_buffers(void)
@@ -712,7 +714,7 @@ int x86_pmu_hw_config(struct perf_event *event)
 
 	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
 		if (event->attr.sample_simd_regs_enabled) {
-			u64 reserved = ~GENMASK_ULL(PERF_REG_X86_64_MAX - 1, 0);
+			u64 reserved = ~GENMASK_ULL(PERF_REG_MISC_MAX - 1, 0);
 
 			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
 				return -EINVAL;
@@ -727,6 +729,11 @@ int x86_pmu_hw_config(struct perf_event *event)
 			     event->attr.sample_regs_intr & PERF_X86_EGPRS_MASK) &&
 			     !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
 				return -EINVAL;
+			if ((event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP) ||
+			     event->attr.sample_regs_intr & BIT_ULL(PERF_REG_X86_SSP)) &&
+			     !(x86_pmu.ext_regs_mask & XFEATURE_MASK_CET_USER))
+				return -EINVAL;
+
 		} else {
 			/*
 			 * Besides the general purpose registers, XMM registers may
@@ -1904,6 +1911,11 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
 			perf_regs->egpr_regs = NULL;
 			mask |= XFEATURE_MASK_APX;
 		}
+		if (attr->sample_regs_user & BIT_ULL(PERF_REG_X86_SSP) ||
+		    attr->sample_regs_intr & BIT_ULL(PERF_REG_X86_SSP)) {
+			perf_regs->cet_regs = NULL;
+			mask |= XFEATURE_MASK_CET_USER;
+		}
 	}
 
 	mask &= ~ignore_mask;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 4400cb66bc8e..28ddff38d232 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -617,6 +617,10 @@ struct x86_perf_regs {
 		u64	*egpr_regs;
 		struct apx_state *egpr;
 	};
+	union {
+		u64	*cet_regs;
+		struct cet_user_state *cet;
+	};
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index cd0f6804debf..4d88cb18acb9 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -48,6 +48,9 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_R31 + 1,
 
+	PERF_REG_X86_SSP,
+	PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
+
 	/*
 	 * These all need two bits set because they are 128bit.
 	 * These are only available when !PERF_SAMPLE_REGS_ABI_SIMD
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index b6e50194ff3e..d579fa3223c0 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -92,6 +92,11 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 					return 0;
 				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
 			}
+			if (idx == PERF_REG_X86_SSP) {
+				if (!perf_regs->cet_regs)
+					return 0;
+				return perf_regs->cet_regs[1];
+			}
 		} else {
 			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
 				if (!perf_regs->xmm_regs)
@@ -179,7 +184,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
 				 ~((1ULL << PERF_REG_X86_MAX) - 1))
 
 #ifdef CONFIG_X86_32
-#define REG_NOSUPPORT GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R8)
+#define REG_NOSUPPORT (GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R8) | \
+		       BIT_ULL(PERF_REG_X86_SSP))
 
 int perf_reg_validate(u64 mask)
 {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V3 13/17] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (11 preceding siblings ...)
  2025-08-15 21:34 ` [PATCH V3 12/17] perf/x86: Add SSP " kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-15 21:34 ` [POC PATCH 14/17] perf/x86/regs: Only support legacy regs for the PT and PERF_REGS_MASK for now kan.liang
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Enable PERF_PMU_CAP_SIMD_REGS if there is XSAVES support for YMM, ZMM,
OPMASK, eGPRs, or SSP.

Disable large PEBS for these registers since PEBS HW doesn't support
them yet.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c | 46 ++++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index bd16f91dea1c..c09176400377 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4033,8 +4033,30 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 		flags &= ~PERF_SAMPLE_TIME;
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
-		flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
+	if (event->attr.sample_simd_regs_enabled) {
+		u64 nolarge = PERF_X86_EGPRS_MASK | BIT_ULL(PERF_REG_X86_SSP);
+
+		/*
+		 * PEBS HW can only collect the XMM0-XMM15 for now.
+		 * Disable large PEBS for other vector registers, predicate
+		 * registers, eGPRs, and SSP.
+		 */
+		if (event->attr.sample_regs_user & nolarge ||
+		    fls64(event->attr.sample_simd_vec_reg_user) > PERF_X86_H16ZMM_BASE ||
+		    event->attr.sample_simd_pred_reg_user)
+			flags &= ~PERF_SAMPLE_REGS_USER;
+
+		if (event->attr.sample_regs_intr & nolarge ||
+		    fls64(event->attr.sample_simd_vec_reg_intr) > PERF_X86_H16ZMM_BASE ||
+		    event->attr.sample_simd_pred_reg_intr)
+			flags &= ~PERF_SAMPLE_REGS_INTR;
+
+		if (event->attr.sample_simd_vec_reg_qwords > PERF_X86_XMM_QWORDS)
+			flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
+	} else {
+		if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
+			flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
+	}
 	return flags;
 }
 
@@ -5295,6 +5317,26 @@ static void intel_extended_regs_init(struct pmu *pmu)
 
 	x86_pmu.ext_regs_mask |= XFEATURE_MASK_SSE;
 	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+
+	if (boot_cpu_has(X86_FEATURE_AVX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_YMM, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_YMM;
+	if (boot_cpu_has(X86_FEATURE_APX) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_APX, NULL))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_APX;
+	if (boot_cpu_has(X86_FEATURE_AVX512F)) {
+		if (cpu_has_xfeatures(XFEATURE_MASK_OPMASK, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_OPMASK;
+		if (cpu_has_xfeatures(XFEATURE_MASK_ZMM_Hi256, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_ZMM_Hi256;
+		if (cpu_has_xfeatures(XFEATURE_MASK_Hi16_ZMM, NULL))
+			x86_pmu.ext_regs_mask |= XFEATURE_MASK_Hi16_ZMM;
+	}
+	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		x86_pmu.ext_regs_mask |= XFEATURE_MASK_CET_USER;
+
+	if (x86_pmu.ext_regs_mask != XFEATURE_MASK_SSE)
+		x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_SIMD_REGS;
 }
 
 static void update_pmu_cap(struct pmu *pmu)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [POC PATCH 14/17] perf/x86/regs: Only support legacy regs for the PT and PERF_REGS_MASK for now
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (12 preceding siblings ...)
  2025-08-15 21:34 ` [PATCH V3 13/17] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-25  9:07   ` Adrian Hunter
  2025-08-15 21:34 ` [POC PATCH 15/17] tools headers: Sync with the kernel sources kan.liang
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The PERF_REG_X86_64_MAX is going to be updated to support more regs,
e.g., eGPRs.
However, the PT and PERF_REGS_MASK will not be touched in the POC.
Using the PERF_REG_X86_R15 + 1 to replace PERF_REG_X86_64_MAX.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/arch/x86/include/perf_regs.h | 2 +-
 tools/perf/util/intel-pt.c              | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/arch/x86/include/perf_regs.h b/tools/perf/arch/x86/include/perf_regs.h
index f209ce2c1dd9..793fb597b03f 100644
--- a/tools/perf/arch/x86/include/perf_regs.h
+++ b/tools/perf/arch/x86/include/perf_regs.h
@@ -17,7 +17,7 @@ void perf_regs_load(u64 *regs);
 		       (1ULL << PERF_REG_X86_ES) | \
 		       (1ULL << PERF_REG_X86_FS) | \
 		       (1ULL << PERF_REG_X86_GS))
-#define PERF_REGS_MASK (((1ULL << PERF_REG_X86_64_MAX) - 1) & ~REG_NOSUPPORT)
+#define PERF_REGS_MASK (((1ULL << (PERF_REG_X86_R15 + 1)) - 1) & ~REG_NOSUPPORT)
 #define PERF_SAMPLE_REGS_ABI PERF_SAMPLE_REGS_ABI_64
 #endif
 
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index 9b1011fe4826..a9585524f2e1 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -2181,7 +2181,7 @@ static u64 *intel_pt_add_gp_regs(struct regs_dump *intr_regs, u64 *pos,
 	u32 bit;
 	int i;
 
-	for (i = 0, bit = 1; i < PERF_REG_X86_64_MAX; i++, bit <<= 1) {
+	for (i = 0, bit = 1; i < PERF_REG_X86_R15 + 1; i++, bit <<= 1) {
 		/* Get the PEBS gp_regs array index */
 		int n = pebs_gp_regs[i] - 1;
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [POC PATCH 15/17] tools headers: Sync with the kernel sources
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (13 preceding siblings ...)
  2025-08-15 21:34 ` [POC PATCH 14/17] perf/x86/regs: Only support legacy regs for the PT and PERF_REGS_MASK for now kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-15 21:34 ` [POC PATCH 16/17] perf parse-regs: Support the new SIMD format kan.liang
  2025-08-15 21:34 ` [POC PATCH 17/17] perf regs: Support the PERF_SAMPLE_REGS_ABI_SIMD kan.liang
  16 siblings, 0 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Update include/uapi/linux/perf_event.h and
arch/x86/include/uapi/asm/perf_regs.h to support extended regs.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/arch/x86/include/uapi/asm/perf_regs.h | 44 ++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h       | 47 ++++++++++++++++++---
 2 files changed, 84 insertions(+), 7 deletions(-)

diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..4d88cb18acb9 100644
--- a/tools/arch/x86/include/uapi/asm/perf_regs.h
+++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,11 +27,34 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/* Extended GPRs (EGPRs) */
+	PERF_REG_X86_R16,
+	PERF_REG_X86_R17,
+	PERF_REG_X86_R18,
+	PERF_REG_X86_R19,
+	PERF_REG_X86_R20,
+	PERF_REG_X86_R21,
+	PERF_REG_X86_R22,
+	PERF_REG_X86_R23,
+	PERF_REG_X86_R24,
+	PERF_REG_X86_R25,
+	PERF_REG_X86_R26,
+	PERF_REG_X86_R27,
+	PERF_REG_X86_R28,
+	PERF_REG_X86_R29,
+	PERF_REG_X86_R30,
+	PERF_REG_X86_R31,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
-	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	PERF_REG_X86_64_MAX = PERF_REG_X86_R31 + 1,
 
-	/* These all need two bits set because they are 128bit */
+	PERF_REG_X86_SSP,
+	PERF_REG_MISC_MAX = PERF_REG_X86_SSP + 1,
+
+	/*
+	 * These all need two bits set because they are 128bit.
+	 * These are only available when !PERF_SAMPLE_REGS_ABI_SIMD
+	 */
 	PERF_REG_X86_XMM0  = 32,
 	PERF_REG_X86_XMM1  = 34,
 	PERF_REG_X86_XMM2  = 36,
@@ -55,4 +78,21 @@ enum perf_event_x86_regs {
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
 
+#define PERF_X86_EGPRS_MASK		GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
+
+#define PERF_X86_SIMD_PRED_REGS_MAX	8
+#define PERF_X86_SIMD_PRED_MASK		GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
+#define PERF_X86_SIMD_VEC_REGS_MAX	32
+#define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
+
+#define PERF_X86_H16ZMM_BASE		16
+
+#define PERF_X86_OPMASK_QWORDS		1
+#define PERF_X86_XMM_QWORDS		2
+#define PERF_X86_YMM_QWORDS		4
+#define PERF_X86_YMMH_QWORDS		(PERF_X86_YMM_QWORDS / 2)
+#define PERF_X86_ZMM_QWORDS		8
+#define PERF_X86_ZMMH_QWORDS		(PERF_X86_ZMM_QWORDS / 2)
+#define PERF_X86_SIMD_QWORDS_MAX	PERF_X86_ZMM_QWORDS
+
 #endif /* _ASM_X86_PERF_REGS_H */
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 78a362b80027..2e9b16acbed6 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -313,9 +313,10 @@ enum {
  * Values to determine ABI of the registers dump.
  */
 enum perf_sample_regs_abi {
-	PERF_SAMPLE_REGS_ABI_NONE		= 0,
-	PERF_SAMPLE_REGS_ABI_32			= 1,
-	PERF_SAMPLE_REGS_ABI_64			= 2,
+	PERF_SAMPLE_REGS_ABI_NONE		= 0x00,
+	PERF_SAMPLE_REGS_ABI_32			= 0x01,
+	PERF_SAMPLE_REGS_ABI_64			= 0x02,
+	PERF_SAMPLE_REGS_ABI_SIMD		= 0x04,
 };
 
 /*
@@ -382,6 +383,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
 #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
+#define PERF_ATTR_SIZE_VER9			168	/* Add: sample_simd_{pred,vec}_reg_* */
 
 /*
  * 'struct perf_event_attr' contains various attributes that define
@@ -543,6 +545,25 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+
+	/*
+	 * Defines set of SIMD registers to dump on samples.
+	 * The sample_simd_regs_enabled !=0 implies the
+	 * set of SIMD registers is used to config all SIMD registers.
+	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
+	 * config some SIMD registers on X86.
+	 */
+	union {
+		__u16 sample_simd_regs_enabled;
+		__u16 sample_simd_pred_reg_qwords;
+	};
+	__u32 sample_simd_pred_reg_intr;
+	__u32 sample_simd_pred_reg_user;
+	__u16 sample_simd_vec_reg_qwords;
+	__u64 sample_simd_vec_reg_intr;
+	__u64 sample_simd_vec_reg_user;
+	__u32 __reserved_4;
 };
 
 /*
@@ -1016,7 +1037,15 @@ enum perf_event_type {
 	 *      } && PERF_SAMPLE_BRANCH_STACK
 	 *
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_USER
 	 *
 	 *	{ u64			size;
 	 *	  char			data[size];
@@ -1043,7 +1072,15 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *		u16 nr_vectors;
+	 *		u16 vector_qwords;
+	 *		u16 nr_pred;
+	 *		u16 pred_qwords;
+	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
+	 *	} && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [POC PATCH 16/17] perf parse-regs: Support the new SIMD format
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (14 preceding siblings ...)
  2025-08-15 21:34 ` [POC PATCH 15/17] tools headers: Sync with the kernel sources kan.liang
@ 2025-08-15 21:34 ` kan.liang
  2025-08-20 10:04   ` Mi, Dapeng
  2025-08-21  3:35   ` Mi, Dapeng
  2025-08-15 21:34 ` [POC PATCH 17/17] perf regs: Support the PERF_SAMPLE_REGS_ABI_SIMD kan.liang
  16 siblings, 2 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Add has_cap_simd_regs() to check if the new SIMD format is available.
If yes, get the possible mask and qwords.

Add several __weak functions to return qwords and mask for vector and
pred registers.

Only support collecting the vector and pred as a whole, and only the
superset. For example, -I XMM,YMM. Only collect all 16 YMMs.

Examples:
 $perf record -I?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7

 $perf record --user-regs=?
 available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
 R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/arch/x86/util/perf_regs.c      | 257 +++++++++++++++++++++-
 tools/perf/util/evsel.c                   |  25 +++
 tools/perf/util/parse-regs-options.c      |  60 ++++-
 tools/perf/util/perf_event_attr_fprintf.c |   6 +
 tools/perf/util/perf_regs.c               |  29 +++
 tools/perf/util/perf_regs.h               |  13 +-
 tools/perf/util/record.h                  |   6 +
 7 files changed, 381 insertions(+), 15 deletions(-)

diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
index 12fd93f04802..78027df1af9a 100644
--- a/tools/perf/arch/x86/util/perf_regs.c
+++ b/tools/perf/arch/x86/util/perf_regs.c
@@ -13,6 +13,49 @@
 #include "../../../util/pmu.h"
 #include "../../../util/pmus.h"
 
+static const struct sample_reg sample_reg_masks_ext[] = {
+	SMPL_REG(AX, PERF_REG_X86_AX),
+	SMPL_REG(BX, PERF_REG_X86_BX),
+	SMPL_REG(CX, PERF_REG_X86_CX),
+	SMPL_REG(DX, PERF_REG_X86_DX),
+	SMPL_REG(SI, PERF_REG_X86_SI),
+	SMPL_REG(DI, PERF_REG_X86_DI),
+	SMPL_REG(BP, PERF_REG_X86_BP),
+	SMPL_REG(SP, PERF_REG_X86_SP),
+	SMPL_REG(IP, PERF_REG_X86_IP),
+	SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
+	SMPL_REG(CS, PERF_REG_X86_CS),
+	SMPL_REG(SS, PERF_REG_X86_SS),
+#ifdef HAVE_ARCH_X86_64_SUPPORT
+	SMPL_REG(R8, PERF_REG_X86_R8),
+	SMPL_REG(R9, PERF_REG_X86_R9),
+	SMPL_REG(R10, PERF_REG_X86_R10),
+	SMPL_REG(R11, PERF_REG_X86_R11),
+	SMPL_REG(R12, PERF_REG_X86_R12),
+	SMPL_REG(R13, PERF_REG_X86_R13),
+	SMPL_REG(R14, PERF_REG_X86_R14),
+	SMPL_REG(R15, PERF_REG_X86_R15),
+	SMPL_REG(R16, PERF_REG_X86_R16),
+	SMPL_REG(R17, PERF_REG_X86_R17),
+	SMPL_REG(R18, PERF_REG_X86_R18),
+	SMPL_REG(R19, PERF_REG_X86_R19),
+	SMPL_REG(R20, PERF_REG_X86_R20),
+	SMPL_REG(R21, PERF_REG_X86_R21),
+	SMPL_REG(R22, PERF_REG_X86_R22),
+	SMPL_REG(R23, PERF_REG_X86_R23),
+	SMPL_REG(R24, PERF_REG_X86_R24),
+	SMPL_REG(R25, PERF_REG_X86_R25),
+	SMPL_REG(R26, PERF_REG_X86_R26),
+	SMPL_REG(R27, PERF_REG_X86_R27),
+	SMPL_REG(R28, PERF_REG_X86_R28),
+	SMPL_REG(R29, PERF_REG_X86_R29),
+	SMPL_REG(R30, PERF_REG_X86_R30),
+	SMPL_REG(R31, PERF_REG_X86_R31),
+	SMPL_REG(SSP, PERF_REG_X86_SSP),
+#endif
+	SMPL_REG_END
+};
+
 static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG(AX, PERF_REG_X86_AX),
 	SMPL_REG(BX, PERF_REG_X86_BX),
@@ -276,27 +319,159 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
 	return SDT_ARG_VALID;
 }
 
+static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
+{
+	struct perf_event_attr attr = {
+		.type				= PERF_TYPE_HARDWARE,
+		.config				= PERF_COUNT_HW_CPU_CYCLES,
+		.sample_type			= sample_type,
+		.disabled 			= 1,
+		.exclude_kernel			= 1,
+		.sample_simd_regs_enabled	= 1,
+	};
+	int fd;
+
+	attr.sample_period = 1;
+
+	if (!pred) {
+		attr.sample_simd_vec_reg_qwords = qwords;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			attr.sample_simd_vec_reg_intr = mask;
+		else
+			attr.sample_simd_vec_reg_user = mask;
+	} else {
+		attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
+		if (sample_type == PERF_SAMPLE_REGS_INTR)
+			attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
+		else
+			attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
+	}
+
+	if (perf_pmus__num_core_pmus() > 1) {
+		struct perf_pmu *pmu = NULL;
+		__u64 type = PERF_TYPE_RAW;
+
+		/*
+		 * The same register set is supported among different hybrid PMUs.
+		 * Only check the first available one.
+		 */
+		while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
+			type = pmu->type;
+			break;
+		}
+		attr.config |= type << PERF_PMU_TYPE_SHIFT;
+	}
+
+	event_attr_init(&attr);
+
+	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+	if (fd != -1) {
+		close(fd);
+		return true;
+	}
+
+	return false;
+}
+
+static uint64_t intr_simd_mask, user_simd_mask, pred_mask;
+static u16	intr_simd_qwords, user_simd_qwords, pred_qwords;
+
+static bool get_simd_reg_mask(u64 sample_type)
+{
+	u64 mask = GENMASK_ULL(PERF_X86_H16ZMM_BASE - 1, 0);
+	u16 qwords = PERF_X86_ZMM_QWORDS;
+
+	if (support_simd_reg(sample_type, qwords, mask, false)) {
+		if (support_simd_reg(sample_type, qwords, PERF_X86_SIMD_VEC_MASK, false))
+			mask = PERF_X86_SIMD_VEC_MASK;
+	} else {
+		qwords = PERF_X86_YMM_QWORDS;
+		if (!support_simd_reg(sample_type, qwords, mask, false)) {
+			qwords = PERF_X86_XMM_QWORDS;
+			if (!support_simd_reg(sample_type, qwords, mask, false)) {
+				qwords = 0;
+				mask = 0;
+			}
+		}
+	}
+
+	if (sample_type == PERF_SAMPLE_REGS_INTR) {
+		intr_simd_mask = mask;
+		intr_simd_qwords = qwords;
+	} else {
+		user_simd_mask = mask;
+		user_simd_qwords = qwords;
+	}
+
+	if (support_simd_reg(sample_type, qwords, mask, true)) {
+		pred_mask = PERF_X86_SIMD_PRED_MASK;
+		pred_qwords = PERF_X86_OPMASK_QWORDS;
+	}
+
+	return true;
+}
+
+static bool has_cap_simd_regs(void)
+{
+	static bool has_cap_simd_regs;
+	static bool cached;
+
+	if (cached)
+		return has_cap_simd_regs;
+
+	cached = true;
+	has_cap_simd_regs = get_simd_reg_mask(PERF_SAMPLE_REGS_INTR);
+	has_cap_simd_regs |= get_simd_reg_mask(PERF_SAMPLE_REGS_USER);
+
+	return has_cap_simd_regs;
+}
+
 const struct sample_reg *arch__sample_reg_masks(void)
 {
+	if (has_cap_simd_regs())
+		return sample_reg_masks_ext;
 	return sample_reg_masks;
 }
 
-uint64_t arch__intr_reg_mask(void)
+static const struct sample_reg sample_simd_reg_masks_empty[] = {
+	SMPL_REG_END
+};
+
+static const struct sample_reg sample_simd_reg_masks[] = {
+	SMPL_REG(XMM, 1),
+	SMPL_REG(YMM, 2),
+	SMPL_REG(ZMM, 3),
+	SMPL_REG(OPMASK, 32),
+	SMPL_REG_END
+};
+
+const struct sample_reg *arch__sample_simd_reg_masks(void)
+{
+	if (has_cap_simd_regs())
+		return sample_simd_reg_masks;
+	return sample_simd_reg_masks_empty;
+}
+
+static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
 {
 	struct perf_event_attr attr = {
-		.type			= PERF_TYPE_HARDWARE,
-		.config			= PERF_COUNT_HW_CPU_CYCLES,
-		.sample_type		= PERF_SAMPLE_REGS_INTR,
-		.sample_regs_intr	= PERF_REG_EXTENDED_MASK,
-		.precise_ip		= 1,
-		.disabled 		= 1,
-		.exclude_kernel		= 1,
+		.type				= PERF_TYPE_HARDWARE,
+		.config				= PERF_COUNT_HW_CPU_CYCLES,
+		.sample_type			= sample_type,
+		.precise_ip			= 1,
+		.disabled 			= 1,
+		.exclude_kernel			= 1,
+		.sample_simd_regs_enabled	= has_simd_regs,
 	};
 	int fd;
 	/*
 	 * In an unnamed union, init it here to build on older gcc versions
 	 */
 	attr.sample_period = 1;
+	if (sample_type == PERF_SAMPLE_REGS_INTR)
+		attr.sample_regs_intr = mask;
+	else
+		attr.sample_regs_user = mask;
 
 	if (perf_pmus__num_core_pmus() > 1) {
 		struct perf_pmu *pmu = NULL;
@@ -318,13 +493,73 @@ uint64_t arch__intr_reg_mask(void)
 	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
 	if (fd != -1) {
 		close(fd);
-		return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
+		return mask;
 	}
 
-	return PERF_REGS_MASK;
+	return 0;
+}
+
+uint64_t arch__intr_reg_mask(void)
+{
+	uint64_t mask = PERF_REGS_MASK;
+
+	if (has_cap_simd_regs()) {
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
+					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
+					 true);
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
+					 BIT_ULL(PERF_REG_X86_SSP),
+					 true);
+	} else
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
+
+	return mask;
 }
 
 uint64_t arch__user_reg_mask(void)
 {
-	return PERF_REGS_MASK;
+	uint64_t mask = PERF_REGS_MASK;
+
+	if (has_cap_simd_regs()) {
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
+					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
+					 true);
+		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
+					 BIT_ULL(PERF_REG_X86_SSP),
+					 true);
+	}
+
+	return mask;
+}
+
+uint64_t arch__intr_simd_reg_mask(u16 *qwords)
+{
+	if (!has_cap_simd_regs())
+		return 0;
+	*qwords = intr_simd_qwords;
+	return intr_simd_mask;
+}
+
+uint64_t arch__user_simd_reg_mask(u16 *qwords)
+{
+	if (!has_cap_simd_regs())
+		return 0;
+	*qwords = user_simd_qwords;
+	return user_simd_mask;
+}
+
+uint64_t arch__intr_pred_reg_mask(u16 *qwords)
+{
+	if (!has_cap_simd_regs())
+		return 0;
+	*qwords = pred_qwords;
+	return pred_mask;
+}
+
+uint64_t arch__user_pred_reg_mask(u16 *qwords)
+{
+	if (!has_cap_simd_regs())
+		return 0;
+	*qwords = pred_qwords;
+	return pred_mask;
 }
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d55482f094bf..af6e1c843fc5 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1402,12 +1402,37 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 		evsel__set_sample_bit(evsel, REGS_INTR);
 	}
 
+	if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
+	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+		/* The pred qwords is to implies the set of SIMD registers is used */
+		if (opts->sample_pred_regs_qwords)
+			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
+		else
+			attr->sample_simd_pred_reg_qwords = 1;
+		attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
+		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
+		attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
+		evsel__set_sample_bit(evsel, REGS_INTR);
+	}
+
 	if (opts->sample_user_regs && !evsel->no_aux_samples &&
 	    !evsel__is_dummy_event(evsel)) {
 		attr->sample_regs_user |= opts->sample_user_regs;
 		evsel__set_sample_bit(evsel, REGS_USER);
 	}
 
+	if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
+	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+		if (opts->sample_pred_regs_qwords)
+			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
+		else
+			attr->sample_simd_pred_reg_qwords = 1;
+		attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
+		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
+		attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
+		evsel__set_sample_bit(evsel, REGS_USER);
+	}
+
 	if (target__has_cpu(&opts->target) || opts->sample_cpu)
 		evsel__set_sample_bit(evsel, CPU);
 
diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
index cda1c620968e..27266038352f 100644
--- a/tools/perf/util/parse-regs-options.c
+++ b/tools/perf/util/parse-regs-options.c
@@ -4,20 +4,26 @@
 #include <stdint.h>
 #include <string.h>
 #include <stdio.h>
+#include <linux/bitops.h>
 #include "util/debug.h"
 #include <subcmd/parse-options.h>
 #include "util/perf_regs.h"
 #include "util/parse-regs-options.h"
+#include "record.h"
 
 static int
 __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 {
 	uint64_t *mode = (uint64_t *)opt->value;
 	const struct sample_reg *r = NULL;
+	u16 simd_qwords, pred_qwords;
+	u64 simd_mask, pred_mask;
+	struct record_opts *opts;
 	char *s, *os = NULL, *p;
 	int ret = -1;
 	uint64_t mask;
 
+
 	if (unset)
 		return 0;
 
@@ -27,10 +33,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 	if (*mode)
 		return -1;
 
-	if (intr)
+	if (intr) {
+		opts = container_of(opt->value, struct record_opts, sample_intr_regs);
 		mask = arch__intr_reg_mask();
-	else
+		simd_mask = arch__intr_simd_reg_mask(&simd_qwords);
+		pred_mask = arch__intr_pred_reg_mask(&pred_qwords);
+	} else {
+		opts = container_of(opt->value, struct record_opts, sample_user_regs);
 		mask = arch__user_reg_mask();
+		simd_mask = arch__user_simd_reg_mask(&simd_qwords);
+		pred_mask = arch__user_pred_reg_mask(&pred_qwords);
+	}
 
 	/* str may be NULL in case no arg is passed to -I */
 	if (str) {
@@ -50,10 +63,51 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 					if (r->mask & mask)
 						fprintf(stderr, "%s ", r->name);
 				}
+				for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+					if (pred_qwords == r->qwords.pred) {
+						fprintf(stderr, "%s0-%d ", r->name, fls64(pred_mask) - 1);
+						continue;
+					}
+					if (simd_qwords >= r->mask)
+						fprintf(stderr, "%s0-%d ", r->name, fls64(simd_mask) - 1);
+				}
+
 				fputc('\n', stderr);
 				/* just printing available regs */
 				goto error;
 			}
+
+			if (simd_mask || pred_mask) {
+				u16 vec_regs_qwords = 0, pred_regs_qwords = 0;
+
+				for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+					if (!strcasecmp(s, r->name)) {
+						vec_regs_qwords = r->qwords.vec;
+						pred_regs_qwords = r->qwords.pred;
+						break;
+					}
+				}
+
+				/* Just need the highest qwords */
+				if (vec_regs_qwords > opts->sample_vec_regs_qwords) {
+					opts->sample_vec_regs_qwords = vec_regs_qwords;
+					if (intr)
+						opts->sample_intr_vec_regs = simd_mask;
+					else
+						opts->sample_user_vec_regs = simd_mask;
+				}
+				if (pred_regs_qwords > opts->sample_pred_regs_qwords) {
+					opts->sample_pred_regs_qwords = pred_regs_qwords;
+					if (intr)
+						opts->sample_intr_pred_regs = pred_mask;
+					else
+						opts->sample_user_pred_regs = pred_mask;
+				}
+
+				if (r->name)
+					goto next;
+			}
+
 			for (r = arch__sample_reg_masks(); r->name; r++) {
 				if ((r->mask & mask) && !strcasecmp(s, r->name))
 					break;
@@ -65,7 +119,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 			}
 
 			*mode |= r->mask;
-
+next:
 			if (!p)
 				break;
 
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 66b666d9ce64..fb0366d050cf 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(aux_start_paused, p_unsigned);
 	PRINT_ATTRf(aux_pause, p_unsigned);
 	PRINT_ATTRf(aux_resume, p_unsigned);
+	PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
+	PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
+	PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
+	PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
+	PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
+	PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
 
 	return ret;
 }
diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
index 44b90bbf2d07..0744c77b4ac8 100644
--- a/tools/perf/util/perf_regs.c
+++ b/tools/perf/util/perf_regs.c
@@ -21,6 +21,30 @@ uint64_t __weak arch__user_reg_mask(void)
 	return 0;
 }
 
+uint64_t __weak arch__intr_simd_reg_mask(u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__user_simd_reg_mask(u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__intr_pred_reg_mask(u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
+uint64_t __weak arch__user_pred_reg_mask(u16 *qwords)
+{
+	*qwords = 0;
+	return 0;
+}
+
 static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG_END
 };
@@ -30,6 +54,11 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
 	return sample_reg_masks;
 }
 
+const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
+{
+	return sample_reg_masks;
+}
+
 const char *perf_reg_name(int id, const char *arch)
 {
 	const char *reg_name = NULL;
diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
index f2d0736d65cc..b932caa73a8a 100644
--- a/tools/perf/util/perf_regs.h
+++ b/tools/perf/util/perf_regs.h
@@ -9,7 +9,13 @@ struct regs_dump;
 
 struct sample_reg {
 	const char *name;
-	uint64_t mask;
+	union {
+		struct {
+			uint32_t vec;
+			uint32_t pred;
+		} qwords;
+		uint64_t mask;
+	};
 };
 
 #define SMPL_REG_MASK(b) (1ULL << (b))
@@ -27,6 +33,11 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op);
 uint64_t arch__intr_reg_mask(void);
 uint64_t arch__user_reg_mask(void);
 const struct sample_reg *arch__sample_reg_masks(void);
+const struct sample_reg *arch__sample_simd_reg_masks(void);
+uint64_t arch__intr_simd_reg_mask(u16 *qwords);
+uint64_t arch__user_simd_reg_mask(u16 *qwords);
+uint64_t arch__intr_pred_reg_mask(u16 *qwords);
+uint64_t arch__user_pred_reg_mask(u16 *qwords);
 
 const char *perf_reg_name(int id, const char *arch);
 int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index ea3a6c4657ee..825ffb4cc53f 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -59,7 +59,13 @@ struct record_opts {
 	unsigned int  user_freq;
 	u64	      branch_stack;
 	u64	      sample_intr_regs;
+	u64	      sample_intr_vec_regs;
 	u64	      sample_user_regs;
+	u64	      sample_user_vec_regs;
+	u16	      sample_pred_regs_qwords;
+	u16	      sample_vec_regs_qwords;
+	u16	      sample_intr_pred_regs;
+	u16	      sample_user_pred_regs;
 	u64	      default_interval;
 	u64	      user_interval;
 	size_t	      auxtrace_snapshot_size;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [POC PATCH 17/17] perf regs: Support the PERF_SAMPLE_REGS_ABI_SIMD
  2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
                   ` (15 preceding siblings ...)
  2025-08-15 21:34 ` [POC PATCH 16/17] perf parse-regs: Support the new SIMD format kan.liang
@ 2025-08-15 21:34 ` kan.liang
  16 siblings, 0 replies; 32+ messages in thread
From: kan.liang @ 2025-08-15 21:34 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, dave.hansen, irogers,
	adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Support the new PERF_SAMPLE_REGS_ABI_SIMD. Dump the data to
perf report -D. Only the superset of the vector registers is displayed
for now.

Example:

 $perf record -e cycles:p -IXMM,YMM,OPMASK,SSP ./test
 $perf report -D
 ... ...
 237538985992962 0x454d0 [0x480]: PERF_RECORD_SAMPLE(IP, 0x1):
 179370/179370: 0xffffffff969627fc period: 124999 addr: 0
 ... intr regs: mask 0x20000000000 ABI 64-bit
 .... SSP   0x0000000000000000
 ... SIMD ABI nr_vectors 32 vector_qwords 4 nr_pred 8 pred_qwords 1
 .... YMM  [0] 0x0000000000004000
 .... YMM  [0] 0x000055e828695270
 .... YMM  [0] 0x0000000000000000
 .... YMM  [0] 0x0000000000000000
 .... YMM  [1] 0x000055e8286990e0
 .... YMM  [1] 0x000055e828698dd0
 .... YMM  [1] 0x0000000000000000
 .... YMM  [1] 0x0000000000000000
 ... ...
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... YMM  [31] 0x0000000000000000
 .... OPMASK[0] 0x0000000000100221
 .... OPMASK[1] 0x0000000000000020
 .... OPMASK[2] 0x000000007fffffff
 .... OPMASK[3] 0x0000000000000000
 .... OPMASK[4] 0x0000000000000000
 .... OPMASK[5] 0x0000000000000000
 .... OPMASK[6] 0x0000000000000000
 .... OPMASK[7] 0x0000000000000000
 ... ...

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/util/evsel.c                       | 18 ++++++
 .../perf/util/perf-regs-arch/perf_regs_x86.c  | 45 ++++++++++++++
 tools/perf/util/sample.h                      | 10 +++
 tools/perf/util/session.c                     | 62 ++++++++++++++++---
 4 files changed, 127 insertions(+), 8 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 7cfb0aab5dd9..e0c0ebfafc23 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -3233,6 +3233,15 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 			regs->mask = mask;
 			regs->regs = (u64 *)array;
 			array = (void *)array + sz;
+
+			if (regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				regs->config = *(u64 *)array;
+				array = (void *)array + sizeof(u64);
+				regs->data = (u64 *)array;
+				sz = (regs->nr_vectors * regs->vector_qwords + regs->nr_pred * regs->pred_qwords) * sizeof(u64);
+				OVERFLOW_CHECK(array, sz, max_size);
+				array = (void *)array + sz;
+			}
 		}
 	}
 
@@ -3290,6 +3299,15 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 			regs->mask = mask;
 			regs->regs = (u64 *)array;
 			array = (void *)array + sz;
+
+			if (regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
+				regs->config = *(u64 *)array;
+				array = (void *)array + sizeof(u64);
+				regs->data = (u64 *)array;
+				sz = (regs->nr_vectors * regs->vector_qwords + regs->nr_pred * regs->pred_qwords) * sizeof(u64);
+				OVERFLOW_CHECK(array, sz, max_size);
+				array = (void *)array + sz;
+			}
 		}
 	}
 
diff --git a/tools/perf/util/perf-regs-arch/perf_regs_x86.c b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
index 708954a9d35d..b494f4504052 100644
--- a/tools/perf/util/perf-regs-arch/perf_regs_x86.c
+++ b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
@@ -5,6 +5,51 @@
 
 const char *__perf_reg_name_x86(int id)
 {
+	u16 qwords;
+
+	if (id > PERF_REG_X86_R15 && arch__intr_simd_reg_mask(&qwords)) {
+		switch (id) {
+		case PERF_REG_X86_R16:
+			return "R16";
+		case PERF_REG_X86_R17:
+			return "R17";
+		case PERF_REG_X86_R18:
+			return "R18";
+		case PERF_REG_X86_R19:
+			return "R19";
+		case PERF_REG_X86_R20:
+			return "R20";
+		case PERF_REG_X86_R21:
+			return "R21";
+		case PERF_REG_X86_R22:
+			return "R22";
+		case PERF_REG_X86_R23:
+			return "R23";
+		case PERF_REG_X86_R24:
+			return "R24";
+		case PERF_REG_X86_R25:
+			return "R25";
+		case PERF_REG_X86_R26:
+			return "R26";
+		case PERF_REG_X86_R27:
+			return "R27";
+		case PERF_REG_X86_R28:
+			return "R28";
+		case PERF_REG_X86_R29:
+			return "R29";
+		case PERF_REG_X86_R30:
+			return "R30";
+		case PERF_REG_X86_R31:
+			return "R31";
+		case PERF_REG_X86_SSP:
+			return "SSP";
+		default:
+			return NULL;
+		}
+
+		return NULL;
+	}
+
 	switch (id) {
 	case PERF_REG_X86_AX:
 		return "AX";
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index 0e96240052e9..36ac4519014b 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -12,6 +12,16 @@ struct regs_dump {
 	u64 abi;
 	u64 mask;
 	u64 *regs;
+	union {
+		u64 config;
+		struct {
+			u16 nr_vectors;
+			u16 vector_qwords;
+			u16 nr_pred;
+			u16 pred_qwords;
+		};
+	};
+	u64 *data;
 
 	/* Cached values/mask filled by first register access. */
 	u64 cache_regs[PERF_SAMPLE_REGS_CACHE_SIZE];
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index a320672c264e..6f931abe2050 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -922,18 +922,62 @@ static void regs_dump__printf(u64 mask, u64 *regs, const char *arch)
 	}
 }
 
-static const char *regs_abi[] = {
-	[PERF_SAMPLE_REGS_ABI_NONE] = "none",
-	[PERF_SAMPLE_REGS_ABI_32] = "32-bit",
-	[PERF_SAMPLE_REGS_ABI_64] = "64-bit",
-};
+static void simd_regs_dump__printf(struct regs_dump *regs)
+{
+	const char *name = "unknown";
+	const struct sample_reg *r;
+	int i, idx = 0;
+
+	if (!(regs->abi & PERF_SAMPLE_REGS_ABI_SIMD))
+		return;
+
+	printf("... SIMD ABI nr_vectors %d vector_qwords %d nr_pred %d pred_qwords %d\n",
+	       regs->nr_vectors, regs->vector_qwords,
+	       regs->nr_pred, regs->pred_qwords);
+
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		if (regs->vector_qwords == r->qwords.vec) {
+			name = r->name;
+			break;
+		}
+	}
+
+	for (i = 0; i < regs->nr_vectors; i++) {
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		if (regs->vector_qwords > 2) {
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		}
+		if (regs->vector_qwords > 4) {
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+			printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+		}
+	}
+
+	name = "unknown";
+	for (r = arch__sample_simd_reg_masks(); r->name; r++) {
+		if (r->qwords.pred && regs->pred_qwords == r->qwords.pred) {
+			name = r->name;
+			break;
+		}
+	}
+	for (i = 0; i < regs->nr_pred; i++)
+		printf(".... %-5s[%d] 0x%016" PRIx64 "\n", name, i, regs->data[idx++]);
+}
 
 static inline const char *regs_dump_abi(struct regs_dump *d)
 {
-	if (d->abi > PERF_SAMPLE_REGS_ABI_64)
-		return "unknown";
+	if (!d->abi)
+		return "none";
+	if (d->abi & PERF_SAMPLE_REGS_ABI_32)
+		return "32-bit";
+	else if (d->abi & PERF_SAMPLE_REGS_ABI_64)
+		return "64-bit";
 
-	return regs_abi[d->abi];
+	return "unknown";
 }
 
 static void regs__printf(const char *type, struct regs_dump *regs, const char *arch)
@@ -946,6 +990,8 @@ static void regs__printf(const char *type, struct regs_dump *regs, const char *a
 	       regs_dump_abi(regs));
 
 	regs_dump__printf(mask, regs->regs, arch);
+
+	simd_regs_dump__printf(regs);
 }
 
 static void regs_user__printf(struct perf_sample *sample, const char *arch)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH V3 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER
  2025-08-15 21:34 ` [PATCH V3 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER kan.liang
@ 2025-08-19 13:39   ` Peter Zijlstra
  2025-08-19 15:55     ` Liang, Kan
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2025-08-19 13:39 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, namhyung, tglx, dave.hansen, irogers, adrian.hunter,
	jolsa, alexander.shishkin, linux-kernel, dapeng1.mi, ak,
	zide.chen, mark.rutland, broonie, ravi.bangoria, eranian

On Fri, Aug 15, 2025 at 02:34:23PM -0700, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Collecting the XMM registers in a PEBS record has been supported since
> the Icelake. But non-PEBS events don't support the feature. It's
> possible to retrieve the XMM registers from the XSAVE for non-PEBS.
> Add it to make the feature complete.
> 
> To utilize the XSAVE, a 64-byte aligned buffer is required. Add a
> per-CPU ext_regs_buf to store the vector registers. The size of the
> buffer is ~2K. kzalloc_node() is used because there's a _guarantee_
> that all kmalloc()'s with powers of 2 are naturally aligned and also
> 64b aligned.
> 
> Extend the support for both REGS_USER and REGS_INTR. For REGS_USER, the
> perf_get_regs_user() returns the regs from the task_pt_regs(current),
> which is struct pt_regs. Need to move it to local struct x86_perf_regs
> x86_user_regs.
> For PEBS, the HW support is still preferred. The XMM should be retrieved
> from PEBS records.
> 
> There could be more vector registers supported later. Add ext_regs_mask
> to track the supported vector register group.


I'm a little confused... *again* :-)

Specifically, we should consider two sets of registers:

 - the live set, as per the CPU (XSAVE)
 - the stored set, as per x86_task_fpu()

regs_intr should always get a copy of the live set; however
regs_user should not. It might need a copy of the x86_task_fpu() instead
of the live set, depending on TIF_NEED_FPU_LOAD (more or less, we need
another variable set in kernel_fpu_begin_mask() *after*
save_fpregs_to_fpstate() is completed).

I don't see this code make this distinction.

Consider getting a sample while the kernel is doing some avx enhanced
crypto and such.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V3 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER
  2025-08-19 13:39   ` Peter Zijlstra
@ 2025-08-19 15:55     ` Liang, Kan
  2025-08-20  9:46       ` Mi, Dapeng
  0 siblings, 1 reply; 32+ messages in thread
From: Liang, Kan @ 2025-08-19 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, namhyung, tglx, dave.hansen, irogers, adrian.hunter,
	jolsa, alexander.shishkin, linux-kernel, dapeng1.mi, ak,
	zide.chen, mark.rutland, broonie, ravi.bangoria, eranian



On 2025-08-19 6:39 a.m., Peter Zijlstra wrote:
> On Fri, Aug 15, 2025 at 02:34:23PM -0700, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Collecting the XMM registers in a PEBS record has been supported since
>> the Icelake. But non-PEBS events don't support the feature. It's
>> possible to retrieve the XMM registers from the XSAVE for non-PEBS.
>> Add it to make the feature complete.
>>
>> To utilize the XSAVE, a 64-byte aligned buffer is required. Add a
>> per-CPU ext_regs_buf to store the vector registers. The size of the
>> buffer is ~2K. kzalloc_node() is used because there's a _guarantee_
>> that all kmalloc()'s with powers of 2 are naturally aligned and also
>> 64b aligned.
>>
>> Extend the support for both REGS_USER and REGS_INTR. For REGS_USER, the
>> perf_get_regs_user() returns the regs from the task_pt_regs(current),
>> which is struct pt_regs. Need to move it to local struct x86_perf_regs
>> x86_user_regs.
>> For PEBS, the HW support is still preferred. The XMM should be retrieved
>> from PEBS records.
>>
>> There could be more vector registers supported later. Add ext_regs_mask
>> to track the supported vector register group.
> 
> 
> I'm a little confused... *again* :-)
> 
> Specifically, we should consider two sets of registers:
> 
>  - the live set, as per the CPU (XSAVE)
>  - the stored set, as per x86_task_fpu()
> 
> regs_intr should always get a copy of the live set; however
> regs_user should not. It might need a copy of the x86_task_fpu() instead
> of the live set, depending on TIF_NEED_FPU_LOAD (more or less, we need
> another variable set in kernel_fpu_begin_mask() *after*
> save_fpregs_to_fpstate() is completed).
> 
> I don't see this code make this distinction.
> 
> Consider getting a sample while the kernel is doing some avx enhanced
> crypto and such.

The regs_user only needs a set when the NMI hits the user mode
(user_mode(regs)) or a non-kernel thread (!(current->flags &
PF_KTHREAD)). The live set is good enough for both cases.

I think the kernel crypto should be to a kernel thread (current->flags &
PF_KTHREAD). If so, the regs_user should return NULL.

Thanks,
Kan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V3 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER
  2025-08-19 15:55     ` Liang, Kan
@ 2025-08-20  9:46       ` Mi, Dapeng
  2025-08-20 18:03         ` Liang, Kan
  0 siblings, 1 reply; 32+ messages in thread
From: Mi, Dapeng @ 2025-08-20  9:46 UTC (permalink / raw)
  To: Liang, Kan, Peter Zijlstra
  Cc: mingo, acme, namhyung, tglx, dave.hansen, irogers, adrian.hunter,
	jolsa, alexander.shishkin, linux-kernel, ak, zide.chen,
	mark.rutland, broonie, ravi.bangoria, eranian


On 8/19/2025 11:55 PM, Liang, Kan wrote:
>
> On 2025-08-19 6:39 a.m., Peter Zijlstra wrote:
>> On Fri, Aug 15, 2025 at 02:34:23PM -0700, kan.liang@linux.intel.com wrote:
>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>
>>> Collecting the XMM registers in a PEBS record has been supported since
>>> the Icelake. But non-PEBS events don't support the feature. It's
>>> possible to retrieve the XMM registers from the XSAVE for non-PEBS.
>>> Add it to make the feature complete.
>>>
>>> To utilize the XSAVE, a 64-byte aligned buffer is required. Add a
>>> per-CPU ext_regs_buf to store the vector registers. The size of the
>>> buffer is ~2K. kzalloc_node() is used because there's a _guarantee_
>>> that all kmalloc()'s with powers of 2 are naturally aligned and also
>>> 64b aligned.
>>>
>>> Extend the support for both REGS_USER and REGS_INTR. For REGS_USER, the
>>> perf_get_regs_user() returns the regs from the task_pt_regs(current),
>>> which is struct pt_regs. Need to move it to local struct x86_perf_regs
>>> x86_user_regs.
>>> For PEBS, the HW support is still preferred. The XMM should be retrieved
>>> from PEBS records.
>>>
>>> There could be more vector registers supported later. Add ext_regs_mask
>>> to track the supported vector register group.
>>
>> I'm a little confused... *again* :-)
>>
>> Specifically, we should consider two sets of registers:
>>
>>  - the live set, as per the CPU (XSAVE)
>>  - the stored set, as per x86_task_fpu()
>>
>> regs_intr should always get a copy of the live set; however
>> regs_user should not. It might need a copy of the x86_task_fpu() instead
>> of the live set, depending on TIF_NEED_FPU_LOAD (more or less, we need
>> another variable set in kernel_fpu_begin_mask() *after*
>> save_fpregs_to_fpstate() is completed).
>>
>> I don't see this code make this distinction.
>>
>> Consider getting a sample while the kernel is doing some avx enhanced
>> crypto and such.
> The regs_user only needs a set when the NMI hits the user mode
> (user_mode(regs)) or a non-kernel thread (!(current->flags &
> PF_KTHREAD)). The live set is good enough for both cases.

It's fine if NMI hits user mode, but if NMI hits the kernel mode
(!(current->flags &PF_KTHREAD)), won't the kernel space SIMD/eGPR regs be
exposed to user space for user-regs option? I'm not sure if kernel space
really use these SIMD/eGPR regs right now, but it seems a risk.


>
> I think the kernel crypto should be to a kernel thread (current->flags &
> PF_KTHREAD). If so, the regs_user should return NULL.
>
> Thanks,
> Kan
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V3 06/17] perf: Support SIMD registers
  2025-08-15 21:34 ` [PATCH V3 06/17] perf: Support SIMD registers kan.liang
@ 2025-08-20  9:55   ` Mi, Dapeng
  2025-08-20 18:08     ` Liang, Kan
  0 siblings, 1 reply; 32+ messages in thread
From: Mi, Dapeng @ 2025-08-20  9:55 UTC (permalink / raw)
  To: kan.liang, peterz, mingo, acme, namhyung, tglx, dave.hansen,
	irogers, adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: ak, zide.chen, mark.rutland, broonie, ravi.bangoria, eranian


On 8/16/2025 5:34 AM, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
>
> The users may be interested in the SIMD registers in a sample while
> profiling. The current sample_regs_XXX doesn't have enough space for all
> SIMD registers.
>
> Add sets of the sample_simd_{pred,vec}_reg_* in the
> struct perf_event_attr to define a set of SIMD registers to dump on
> samples.
> The current X86 supports the XMM registers in sample_regs_XXX. To
> utilize the new SIMD registers configuration method, the
> sample_simd_regs_enabled should always be set. If so, the XMM space in
> the sample_regs_XXX is reserved for other usage.
>
> The SIMD registers are wider than 64. A new output format is introduced.
> The number and width of SIMD registers will be dumped first, following
> the register values. The number and width are the same as the user's
> configuration now. If, for some reason (e.g., ARM) they are different,
> an ARCH-specific perf_output_sample_simd_regs can be implemented later
> separately.
> Add a new ABI, PERF_SAMPLE_REGS_ABI_SIMD, to indicate the new format.
> The enum perf_sample_regs_abi becomes a bitmap now. There should be no
> impact on the existing tool, since the version and bitmap are the same
> for 1 and 2.
>
> Add three new __weak functions to retrieve the number of available
> registers, validate the configuration of the SIMD registers, and
> retrieve the SIMD registers. The ARCH-specific functions will be
> implemented in the following patches.
>
> Add a new flag PERF_PMU_CAP_SIMD_REGS to indicate that the PMU has the
> capability to support SIMD registers dumping. Error out if the
> sample_simd_{pred,vec}_reg_* mistakenly set for a PMU that doesn't have
> the capability.
>
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  include/linux/perf_event.h      |  13 ++++
>  include/linux/perf_regs.h       |   9 +++
>  include/uapi/linux/perf_event.h |  47 +++++++++++++--
>  kernel/events/core.c            | 101 +++++++++++++++++++++++++++++++-
>  4 files changed, 162 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 444b162f3f92..205361b7de2e 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -305,6 +305,7 @@ struct perf_event_pmu_context;
>  #define PERF_PMU_CAP_EXTENDED_HW_TYPE	0x0100
>  #define PERF_PMU_CAP_AUX_PAUSE		0x0200
>  #define PERF_PMU_CAP_AUX_PREFER_LARGE	0x0400
> +#define PERF_PMU_CAP_SIMD_REGS		0x0800
>  
>  /**
>   * pmu::scope
> @@ -1526,6 +1527,18 @@ perf_event__output_id_sample(struct perf_event *event,
>  extern void
>  perf_log_lost_samples(struct perf_event *event, u64 lost);
>  
> +static inline bool event_has_simd_regs(struct perf_event *event)
> +{
> +	struct perf_event_attr *attr = &event->attr;
> +
> +	return attr->sample_simd_regs_enabled != 0 ||
> +	       attr->sample_simd_pred_reg_intr != 0 ||
> +	       attr->sample_simd_pred_reg_user != 0 ||
> +	       attr->sample_simd_vec_reg_qwords != 0 ||
> +	       attr->sample_simd_vec_reg_intr != 0 ||
> +	       attr->sample_simd_vec_reg_user != 0;
> +}
> +
>  static inline bool event_has_extended_regs(struct perf_event *event)
>  {
>  	struct perf_event_attr *attr = &event->attr;
> diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
> index f632c5725f16..0172682b18fd 100644
> --- a/include/linux/perf_regs.h
> +++ b/include/linux/perf_regs.h
> @@ -9,6 +9,15 @@ struct perf_regs {
>  	struct pt_regs	*regs;
>  };
>  
> +int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
> +			   u16 pred_qwords, u32 pred_mask);
> +u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
> +			u16 qwords_idx, bool pred);
> +void perf_simd_reg_check(struct pt_regs *regs,
> +			 u64 mask, u16 *nr_vectors, u16 *vec_qwords,
> +			 u16 pred_mask, u16 *nr_pred, u16 *pred_qwords);
> +
> +
>  #ifdef CONFIG_HAVE_PERF_REGS
>  #include <asm/perf_regs.h>
>  
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 78a362b80027..2e9b16acbed6 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -313,9 +313,10 @@ enum {
>   * Values to determine ABI of the registers dump.
>   */
>  enum perf_sample_regs_abi {
> -	PERF_SAMPLE_REGS_ABI_NONE		= 0,
> -	PERF_SAMPLE_REGS_ABI_32			= 1,
> -	PERF_SAMPLE_REGS_ABI_64			= 2,
> +	PERF_SAMPLE_REGS_ABI_NONE		= 0x00,
> +	PERF_SAMPLE_REGS_ABI_32			= 0x01,
> +	PERF_SAMPLE_REGS_ABI_64			= 0x02,
> +	PERF_SAMPLE_REGS_ABI_SIMD		= 0x04,

Better change the definition to bitmap format, so it clearly indicates the
ABI is a bitmap format.

enum perf_sample_regs_abi {
    PERF_SAMPLE_REGS_ABI_NONE        = 0,
    PERF_SAMPLE_REGS_ABI_32            = 1 << 0,
    PERF_SAMPLE_REGS_ABI_64            = 1 << 1,
    PERF_SAMPLE_REGS_ABI_SIMD        = 1 << 2,
};



>  };
>  
>  /*
> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>  #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
>  #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
>  #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
> +#define PERF_ATTR_SIZE_VER9			168	/* Add: sample_simd_{pred,vec}_reg_* */
>  
>  /*
>   * 'struct perf_event_attr' contains various attributes that define
> @@ -543,6 +545,25 @@ struct perf_event_attr {
>  	__u64	sig_data;
>  
>  	__u64	config3; /* extension of config2 */
> +
> +
> +	/*
> +	 * Defines set of SIMD registers to dump on samples.
> +	 * The sample_simd_regs_enabled !=0 implies the
> +	 * set of SIMD registers is used to config all SIMD registers.
> +	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
> +	 * config some SIMD registers on X86.
> +	 */
> +	union {
> +		__u16 sample_simd_regs_enabled;
> +		__u16 sample_simd_pred_reg_qwords;
> +	};
> +	__u32 sample_simd_pred_reg_intr;
> +	__u32 sample_simd_pred_reg_user;
> +	__u16 sample_simd_vec_reg_qwords;
> +	__u64 sample_simd_vec_reg_intr;
> +	__u64 sample_simd_vec_reg_user;
> +	__u32 __reserved_4;
>  };
>  
>  /*
> @@ -1016,7 +1037,15 @@ enum perf_event_type {
>  	 *      } && PERF_SAMPLE_BRANCH_STACK
>  	 *
>  	 *	{ u64			abi; # enum perf_sample_regs_abi
> -	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> +	 *	  u64			regs[weight(mask)];
> +	 *	  struct {
> +	 *		u16 nr_vectors;
> +	 *		u16 vector_qwords;
> +	 *		u16 nr_pred;
> +	 *		u16 pred_qwords;
> +	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +	 *	} && PERF_SAMPLE_REGS_USER
>  	 *
>  	 *	{ u64			size;
>  	 *	  char			data[size];
> @@ -1043,7 +1072,15 @@ enum perf_event_type {
>  	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
>  	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
>  	 *	{ u64			abi; # enum perf_sample_regs_abi
> -	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> +	 *	  u64			regs[weight(mask)];
> +	 *	  struct {
> +	 *		u16 nr_vectors;
> +	 *		u16 vector_qwords;
> +	 *		u16 nr_pred;
> +	 *		u16 pred_qwords;
> +	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
> +	 *	} && PERF_SAMPLE_REGS_INTR
>  	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>  	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
>  	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 95a7b6f5af09..dd8cf3c7fb7a 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7408,6 +7408,47 @@ perf_output_sample_regs(struct perf_output_handle *handle,
>  	}
>  }
>  
> +static void
> +perf_output_sample_simd_regs(struct perf_output_handle *handle,
> +			     struct perf_event *event,
> +			     struct pt_regs *regs,
> +			     u64 mask, u16 pred_mask)
> +{
> +	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
> +	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
> +	u16 nr_pred = hweight16(pred_mask);
> +	u16 nr_vectors = hweight64(mask);
> +	int bit;
> +	u64 val;
> +	u16 i;
> +
> +	/* Get the number of available regs */
> +	perf_simd_reg_check(regs, mask, &nr_vectors, &vec_qwords,
> +			    pred_mask, &nr_pred, &pred_qwords);
> +
> +	perf_output_put(handle, nr_vectors);
> +	perf_output_put(handle, vec_qwords);
> +	perf_output_put(handle, nr_pred);
> +	perf_output_put(handle, pred_qwords);
> +
> +	if (nr_vectors) {
> +		for_each_set_bit(bit, (unsigned long *)&mask, sizeof(mask) * BITS_PER_BYTE) {
> +			for (i = 0; i < vec_qwords; i++) {
> +				val = perf_simd_reg_value(regs, bit, i, false);
> +				perf_output_put(handle, val);
> +			}
> +		}
> +	}
> +	if (nr_pred) {
> +		for_each_set_bit(bit, (unsigned long *)&pred_mask, sizeof(pred_mask) * BITS_PER_BYTE) {
> +			for (i = 0; i < pred_qwords; i++) {
> +				val = perf_simd_reg_value(regs, bit, i, true);
> +				perf_output_put(handle, val);
> +			}
> +		}
> +	}
> +}
> +
>  static void perf_sample_regs_user(struct perf_regs *regs_user,
>  				  struct pt_regs *regs)
>  {
> @@ -7429,6 +7470,25 @@ static void perf_sample_regs_intr(struct perf_regs *regs_intr,
>  	regs_intr->abi  = perf_reg_abi(current);
>  }
>  
> +int __weak perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
> +				  u16 pred_qwords, u32 pred_mask)
> +{
> +	return vec_qwords || vec_mask || pred_qwords || pred_mask ? -ENOSYS : 0;
> +}
> +
> +u64 __weak perf_simd_reg_value(struct pt_regs *regs, int idx,
> +			       u16 qwords_idx, bool pred)
> +{
> +	return 0;
> +}
> +
> +void __weak perf_simd_reg_check(struct pt_regs *regs,
> +				u64 mask, u16 *nr_vectors, u16 *vec_qwords,
> +				u16 pred_mask, u16 *nr_pred, u16 *pred_qwords)
> +{
> +	*nr_vectors = 0;
> +	*nr_pred = 0;
> +}
>  
>  /*
>   * Get remaining task size from user stack pointer.
> @@ -7961,10 +8021,17 @@ void perf_output_sample(struct perf_output_handle *handle,
>  		perf_output_put(handle, abi);
>  
>  		if (abi) {
> -			u64 mask = event->attr.sample_regs_user;
> +			struct perf_event_attr *attr = &event->attr;
> +			u64 mask = attr->sample_regs_user;
>  			perf_output_sample_regs(handle,
>  						data->regs_user.regs,
>  						mask);
> +			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
> +				perf_output_sample_simd_regs(handle, event,
> +							     data->regs_user.regs,
> +							     attr->sample_simd_vec_reg_user,
> +							     attr->sample_simd_pred_reg_user);
> +			}
>  		}
>  	}
>  
> @@ -7992,11 +8059,18 @@ void perf_output_sample(struct perf_output_handle *handle,
>  		perf_output_put(handle, abi);
>  
>  		if (abi) {
> -			u64 mask = event->attr.sample_regs_intr;
> +			struct perf_event_attr *attr = &event->attr;
> +			u64 mask = attr->sample_regs_intr;
>  
>  			perf_output_sample_regs(handle,
>  						data->regs_intr.regs,
>  						mask);
> +			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
> +				perf_output_sample_simd_regs(handle, event,
> +							     data->regs_intr.regs,
> +							     attr->sample_simd_vec_reg_intr,
> +							     attr->sample_simd_pred_reg_intr);
> +			}
>  		}
>  	}
>  
> @@ -12560,6 +12634,12 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
>  	if (ret)
>  		goto err_pmu;
>  
> +	if (!(pmu->capabilities & PERF_PMU_CAP_SIMD_REGS) &&
> +	    event_has_simd_regs(event)) {
> +		ret = -EOPNOTSUPP;
> +		goto err_destroy;
> +	}
> +
>  	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
>  	    event_has_extended_regs(event)) {
>  		ret = -EOPNOTSUPP;
> @@ -13101,6 +13181,12 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
>  		ret = perf_reg_validate(attr->sample_regs_user);
>  		if (ret)
>  			return ret;
> +		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
> +					     attr->sample_simd_vec_reg_user,
> +					     attr->sample_simd_pred_reg_qwords,
> +					     attr->sample_simd_pred_reg_user);
> +		if (ret)
> +			return ret;
>  	}
>  
>  	if (attr->sample_type & PERF_SAMPLE_STACK_USER) {
> @@ -13121,8 +13207,17 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
>  	if (!attr->sample_max_stack)
>  		attr->sample_max_stack = sysctl_perf_event_max_stack;
>  
> -	if (attr->sample_type & PERF_SAMPLE_REGS_INTR)
> +	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
>  		ret = perf_reg_validate(attr->sample_regs_intr);
> +		if (ret)
> +			return ret;
> +		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
> +					     attr->sample_simd_vec_reg_intr,
> +					     attr->sample_simd_pred_reg_qwords,
> +					     attr->sample_simd_pred_reg_intr);
> +		if (ret)
> +			return ret;
> +	}
>  
>  #ifndef CONFIG_CGROUP_PERF
>  	if (attr->sample_type & PERF_SAMPLE_CGROUP)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V3 08/17] perf/x86: Add YMM into sample_simd_vec_regs
  2025-08-15 21:34 ` [PATCH V3 08/17] perf/x86: Add YMM into sample_simd_vec_regs kan.liang
@ 2025-08-20  9:59   ` Mi, Dapeng
  2025-08-20 18:10     ` Liang, Kan
  0 siblings, 1 reply; 32+ messages in thread
From: Mi, Dapeng @ 2025-08-20  9:59 UTC (permalink / raw)
  To: kan.liang, peterz, mingo, acme, namhyung, tglx, dave.hansen,
	irogers, adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: ak, zide.chen, mark.rutland, broonie, ravi.bangoria, eranian


On 8/16/2025 5:34 AM, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
>
> The YMM0-15 is composed of XMM and YMMH. It requires 2 XSAVE commands to
> get the complete value. Internally, the XMM and YMMH are stored in
> different structures, which follow the XSAVE format. But the output
> dumps the YMM as a whole.
>
> The qwords 4 imply YMM.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  arch/x86/events/core.c                | 13 +++++++++++++
>  arch/x86/include/asm/perf_event.h     |  4 ++++
>  arch/x86/include/uapi/asm/perf_regs.h |  4 +++-
>  arch/x86/kernel/perf_regs.c           | 10 +++++++++-
>  4 files changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 1789b91c95c6..aebd4e56dff1 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -423,6 +423,9 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
>  
>  	if (valid_mask & XFEATURE_MASK_SSE)
>  		perf_regs->xmm_space = xsave->i387.xmm_space;
> +
> +	if (valid_mask & XFEATURE_MASK_YMM)
> +		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
>  }
>  
>  static void release_ext_regs_buffers(void)
> @@ -725,6 +728,9 @@ int x86_pmu_hw_config(struct perf_event *event)
>  			if (event->attr.sample_simd_vec_reg_qwords >= PERF_X86_XMM_QWORDS &&
>  			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
>  				return -EINVAL;
> +			if (event->attr.sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS &&
> +			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
> +				return -EINVAL;
>  		}
>  	}
>  	return x86_setup_perfctr(event);
> @@ -1837,6 +1843,13 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
>  		mask |= XFEATURE_MASK_SSE;
>  	}
>  
> +	if (attr->sample_simd_regs_enabled) {
> +		if (attr->sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS) {
> +			perf_regs->ymmh_regs = NULL;
> +			mask |= XFEATURE_MASK_YMM;
> +		}
> +	}
> +
>  	mask &= ~ignore_mask;
>  	if (mask)
>  		x86_pmu_get_ext_regs(perf_regs, mask);
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index 538219c59979..81e3143fd91a 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -597,6 +597,10 @@ struct x86_perf_regs {
>  		u64	*xmm_regs;
>  		u32	*xmm_space;	/* for xsaves */
>  	};
> +	union {
> +		u64	*ymmh_regs;
> +		struct ymmh_struct *ymmh;
> +	};
>  };
>  
>  extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
> index bd8af802f757..feb3e8f80761 100644
> --- a/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/arch/x86/include/uapi/asm/perf_regs.h
> @@ -59,6 +59,8 @@ enum perf_event_x86_regs {
>  #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
>  
>  #define PERF_X86_XMM_QWORDS		2
> -#define PERF_X86_SIMD_QWORDS_MAX	PERF_X86_XMM_QWORDS
> +#define PERF_X86_YMM_QWORDS		4
> +#define PERF_X86_YMMH_QWORDS		(PERF_X86_YMM_QWORDS / 2)
> +#define PERF_X86_SIMD_QWORDS_MAX	PERF_X86_YMM_QWORDS
>  
>  #endif /* _ASM_X86_PERF_REGS_H */
> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
> index 397357c5896b..d94bc687e4bf 100644
> --- a/arch/x86/kernel/perf_regs.c
> +++ b/arch/x86/kernel/perf_regs.c
> @@ -66,6 +66,9 @@ void perf_simd_reg_check(struct pt_regs *regs,
>  	if (*vec_qwords >= PERF_X86_XMM_QWORDS && !perf_regs->xmm_regs)
>  		*nr_vectors = 0;
>  
> +	if (*vec_qwords >= PERF_X86_YMM_QWORDS && !perf_regs->xmm_regs)

should be "!perf_regs->ymmh_regs"?


> +		*vec_qwords = PERF_X86_XMM_QWORDS;
> +
>  	*nr_pred = 0;
>  }
>  
> @@ -105,6 +108,10 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
>  		if (!perf_regs->xmm_regs)
>  			return 0;
>  		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS + qwords_idx];
> +	} else if (qwords_idx < PERF_X86_YMM_QWORDS) {
> +		if (!perf_regs->ymmh_regs)
> +			return 0;
> +		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS + qwords_idx - PERF_X86_XMM_QWORDS];
>  	}
>  
>  	return 0;
> @@ -121,7 +128,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
>  		if (vec_mask)
>  			return -EINVAL;
>  	} else {
> -		if (vec_qwords != PERF_X86_XMM_QWORDS)
> +		if (vec_qwords != PERF_X86_XMM_QWORDS &&
> +		    vec_qwords != PERF_X86_YMM_QWORDS)
>  			return -EINVAL;
>  		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
>  			return -EINVAL;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V3 11/17] perf/x86: Add eGPRs into sample_regs
  2025-08-15 21:34 ` [PATCH V3 11/17] perf/x86: Add eGPRs into sample_regs kan.liang
@ 2025-08-20 10:01   ` Mi, Dapeng
  0 siblings, 0 replies; 32+ messages in thread
From: Mi, Dapeng @ 2025-08-20 10:01 UTC (permalink / raw)
  To: kan.liang, peterz, mingo, acme, namhyung, tglx, dave.hansen,
	irogers, adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: ak, zide.chen, mark.rutland, broonie, ravi.bangoria, eranian


On 8/16/2025 5:34 AM, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
>
> The eGPRs is only supported when the new SIMD registers configuration
> method is used, which moves the XMM to sample_simd_vec_regs. So the
> space can be reclaimed for the eGPRs.
>
> The eGPRs is retrieved by XSAVE. Only support the eGPRs for X86_64.
>
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  arch/x86/events/core.c                | 39 +++++++++++++++++++++------
>  arch/x86/include/asm/perf_event.h     |  4 +++
>  arch/x86/include/uapi/asm/perf_regs.h | 26 ++++++++++++++++--
>  arch/x86/kernel/perf_regs.c           | 31 ++++++++++-----------
>  4 files changed, 75 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 1fa550efcdfa..f816290defc1 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -432,6 +432,8 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
>  		perf_regs->h16zmm = get_xsave_addr(xsave, XFEATURE_Hi16_ZMM);
>  	if (valid_mask & XFEATURE_MASK_OPMASK)
>  		perf_regs->opmask = get_xsave_addr(xsave, XFEATURE_OPMASK);
> +	if (valid_mask & XFEATURE_MASK_APX)
> +		perf_regs->egpr = get_xsave_addr(xsave, XFEATURE_APX);
>  }
>  
>  static void release_ext_regs_buffers(void)
> @@ -709,17 +711,33 @@ int x86_pmu_hw_config(struct perf_event *event)
>  	}
>  
>  	if (event->attr.sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)) {
> -		/*
> -		 * Besides the general purpose registers, XMM registers may
> -		 * be collected as well.
> -		 */
> -		if (event_has_extended_regs(event)) {
> -			if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
> +		if (event->attr.sample_simd_regs_enabled) {
> +			u64 reserved = ~GENMASK_ULL(PERF_REG_X86_64_MAX - 1, 0);
> +
> +			if (!(event->pmu->capabilities & PERF_PMU_CAP_SIMD_REGS))
>  				return -EINVAL;
> -			if (!(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
> +			/*
> +			 * The XMM space in the perf_event_x86_regs is reclaimed
> +			 * for eGPRs and other general registers.
> +			 */
> +			if (event->attr.sample_regs_user & reserved ||
> +			    event->attr.sample_regs_intr & reserved)
>  				return -EINVAL;
> -			if (event->attr.sample_simd_regs_enabled)
> +			if ((event->attr.sample_regs_user & PERF_X86_EGPRS_MASK ||
> +			     event->attr.sample_regs_intr & PERF_X86_EGPRS_MASK) &&
> +			     !(x86_pmu.ext_regs_mask & XFEATURE_MASK_APX))
>  				return -EINVAL;
> +		} else {
> +			/*
> +			 * Besides the general purpose registers, XMM registers may
> +			 * be collected as well.
> +			 */
> +			if (event_has_extended_regs(event)) {
> +				if (!(event->pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS))
> +					return -EINVAL;
> +				if (!(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
> +					return -EINVAL;
> +			}
>  		}
>  
>  		if (event_has_simd_regs(event)) {
> @@ -1881,6 +1899,11 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
>  			perf_regs->opmask_regs = NULL;
>  			mask |= XFEATURE_MASK_OPMASK;
>  		}
> +		if (attr->sample_regs_user & PERF_X86_EGPRS_MASK ||
> +		    attr->sample_regs_intr & PERF_X86_EGPRS_MASK) {
> +			perf_regs->egpr_regs = NULL;
> +			mask |= XFEATURE_MASK_APX;
> +		}
>  	}
>  
>  	mask &= ~ignore_mask;
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index dda677022882..4400cb66bc8e 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -613,6 +613,10 @@ struct x86_perf_regs {
>  		u64	*opmask_regs;
>  		struct avx_512_opmask_state *opmask;
>  	};
> +	union {
> +		u64	*egpr_regs;
> +		struct apx_state *egpr;
> +	};
>  };
>  
>  extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
> index dd7bd1dd8d39..cd0f6804debf 100644
> --- a/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/arch/x86/include/uapi/asm/perf_regs.h
> @@ -27,11 +27,31 @@ enum perf_event_x86_regs {
>  	PERF_REG_X86_R13,
>  	PERF_REG_X86_R14,
>  	PERF_REG_X86_R15,
> +	/* Extended GPRs (EGPRs) */
> +	PERF_REG_X86_R16,
> +	PERF_REG_X86_R17,
> +	PERF_REG_X86_R18,
> +	PERF_REG_X86_R19,
> +	PERF_REG_X86_R20,
> +	PERF_REG_X86_R21,
> +	PERF_REG_X86_R22,
> +	PERF_REG_X86_R23,
> +	PERF_REG_X86_R24,
> +	PERF_REG_X86_R25,
> +	PERF_REG_X86_R26,
> +	PERF_REG_X86_R27,
> +	PERF_REG_X86_R28,
> +	PERF_REG_X86_R29,
> +	PERF_REG_X86_R30,
> +	PERF_REG_X86_R31,
>  	/* These are the limits for the GPRs. */
>  	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
> -	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
> +	PERF_REG_X86_64_MAX = PERF_REG_X86_R31 + 1,
>  
> -	/* These all need two bits set because they are 128bit */
> +	/*
> +	 * These all need two bits set because they are 128bit.
> +	 * These are only available when !PERF_SAMPLE_REGS_ABI_SIMD
> +	 */

The eGPR indexes are overlapped with XMM indexes. User may get confused
about this, we'd better add comments to explain it.


>  	PERF_REG_X86_XMM0  = 32,
>  	PERF_REG_X86_XMM1  = 34,
>  	PERF_REG_X86_XMM2  = 36,
> @@ -55,6 +75,8 @@ enum perf_event_x86_regs {
>  
>  #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
>  
> +#define PERF_X86_EGPRS_MASK		GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16)
> +
>  #define PERF_X86_SIMD_PRED_REGS_MAX	8
>  #define PERF_X86_SIMD_PRED_MASK		GENMASK(PERF_X86_SIMD_PRED_REGS_MAX - 1, 0)
>  #define PERF_X86_SIMD_VEC_REGS_MAX	32
> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
> index 5e815f806605..b6e50194ff3e 100644
> --- a/arch/x86/kernel/perf_regs.c
> +++ b/arch/x86/kernel/perf_regs.c
> @@ -83,14 +83,22 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>  {
>  	struct x86_perf_regs *perf_regs;
>  
> -	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
> +	if (idx > PERF_REG_X86_R15) {
>  		perf_regs = container_of(regs, struct x86_perf_regs, regs);
> -		/* SIMD registers are moved to dedicated sample_simd_vec_reg */
> -		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD)
> -			return 0;
> -		if (!perf_regs->xmm_regs)
> -			return 0;
> -		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
> +
> +		if (perf_regs->abi & PERF_SAMPLE_REGS_ABI_SIMD) {
> +			if (idx <= PERF_REG_X86_R31) {
> +				if (!perf_regs->egpr_regs)
> +					return 0;
> +				return perf_regs->egpr_regs[idx - PERF_REG_X86_R16];
> +			}
> +		} else {
> +			if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
> +				if (!perf_regs->xmm_regs)
> +					return 0;
> +				return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
> +			}
> +		}
>  	}
>  
>  	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
> @@ -171,14 +179,7 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
>  				 ~((1ULL << PERF_REG_X86_MAX) - 1))
>  
>  #ifdef CONFIG_X86_32
> -#define REG_NOSUPPORT ((1ULL << PERF_REG_X86_R8) | \
> -		       (1ULL << PERF_REG_X86_R9) | \
> -		       (1ULL << PERF_REG_X86_R10) | \
> -		       (1ULL << PERF_REG_X86_R11) | \
> -		       (1ULL << PERF_REG_X86_R12) | \
> -		       (1ULL << PERF_REG_X86_R13) | \
> -		       (1ULL << PERF_REG_X86_R14) | \
> -		       (1ULL << PERF_REG_X86_R15))
> +#define REG_NOSUPPORT GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R8)
>  
>  int perf_reg_validate(u64 mask)
>  {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC PATCH 16/17] perf parse-regs: Support the new SIMD format
  2025-08-15 21:34 ` [POC PATCH 16/17] perf parse-regs: Support the new SIMD format kan.liang
@ 2025-08-20 10:04   ` Mi, Dapeng
  2025-08-20 18:18     ` Liang, Kan
  2025-08-21  3:35   ` Mi, Dapeng
  1 sibling, 1 reply; 32+ messages in thread
From: Mi, Dapeng @ 2025-08-20 10:04 UTC (permalink / raw)
  To: kan.liang, peterz, mingo, acme, namhyung, tglx, dave.hansen,
	irogers, adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: ak, zide.chen, mark.rutland, broonie, ravi.bangoria, eranian


On 8/16/2025 5:34 AM, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Add has_cap_simd_regs() to check if the new SIMD format is available.
> If yes, get the possible mask and qwords.
>
> Add several __weak functions to return qwords and mask for vector and
> pred registers.
>
> Only support collecting the vector and pred as a whole, and only the
> superset. For example, -I XMM,YMM. Only collect all 16 YMMs.
>
> Examples:
>  $perf record -I?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7

I still have no time to fully review this patch, but the output on SPR
seems incorrect.

./perf record -I?
available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10 R11
R12 R13 R14 R15 XMM0--1 YMM0--1 ZMM0--1


>
>  $perf record --user-regs=?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  tools/perf/arch/x86/util/perf_regs.c      | 257 +++++++++++++++++++++-
>  tools/perf/util/evsel.c                   |  25 +++
>  tools/perf/util/parse-regs-options.c      |  60 ++++-
>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>  tools/perf/util/perf_regs.c               |  29 +++
>  tools/perf/util/perf_regs.h               |  13 +-
>  tools/perf/util/record.h                  |   6 +
>  7 files changed, 381 insertions(+), 15 deletions(-)
>
> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> index 12fd93f04802..78027df1af9a 100644
> --- a/tools/perf/arch/x86/util/perf_regs.c
> +++ b/tools/perf/arch/x86/util/perf_regs.c
> @@ -13,6 +13,49 @@
>  #include "../../../util/pmu.h"
>  #include "../../../util/pmus.h"
>  
> +static const struct sample_reg sample_reg_masks_ext[] = {
> +	SMPL_REG(AX, PERF_REG_X86_AX),
> +	SMPL_REG(BX, PERF_REG_X86_BX),
> +	SMPL_REG(CX, PERF_REG_X86_CX),
> +	SMPL_REG(DX, PERF_REG_X86_DX),
> +	SMPL_REG(SI, PERF_REG_X86_SI),
> +	SMPL_REG(DI, PERF_REG_X86_DI),
> +	SMPL_REG(BP, PERF_REG_X86_BP),
> +	SMPL_REG(SP, PERF_REG_X86_SP),
> +	SMPL_REG(IP, PERF_REG_X86_IP),
> +	SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> +	SMPL_REG(CS, PERF_REG_X86_CS),
> +	SMPL_REG(SS, PERF_REG_X86_SS),
> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> +	SMPL_REG(R8, PERF_REG_X86_R8),
> +	SMPL_REG(R9, PERF_REG_X86_R9),
> +	SMPL_REG(R10, PERF_REG_X86_R10),
> +	SMPL_REG(R11, PERF_REG_X86_R11),
> +	SMPL_REG(R12, PERF_REG_X86_R12),
> +	SMPL_REG(R13, PERF_REG_X86_R13),
> +	SMPL_REG(R14, PERF_REG_X86_R14),
> +	SMPL_REG(R15, PERF_REG_X86_R15),
> +	SMPL_REG(R16, PERF_REG_X86_R16),
> +	SMPL_REG(R17, PERF_REG_X86_R17),
> +	SMPL_REG(R18, PERF_REG_X86_R18),
> +	SMPL_REG(R19, PERF_REG_X86_R19),
> +	SMPL_REG(R20, PERF_REG_X86_R20),
> +	SMPL_REG(R21, PERF_REG_X86_R21),
> +	SMPL_REG(R22, PERF_REG_X86_R22),
> +	SMPL_REG(R23, PERF_REG_X86_R23),
> +	SMPL_REG(R24, PERF_REG_X86_R24),
> +	SMPL_REG(R25, PERF_REG_X86_R25),
> +	SMPL_REG(R26, PERF_REG_X86_R26),
> +	SMPL_REG(R27, PERF_REG_X86_R27),
> +	SMPL_REG(R28, PERF_REG_X86_R28),
> +	SMPL_REG(R29, PERF_REG_X86_R29),
> +	SMPL_REG(R30, PERF_REG_X86_R30),
> +	SMPL_REG(R31, PERF_REG_X86_R31),
> +	SMPL_REG(SSP, PERF_REG_X86_SSP),
> +#endif
> +	SMPL_REG_END
> +};
> +
>  static const struct sample_reg sample_reg_masks[] = {
>  	SMPL_REG(AX, PERF_REG_X86_AX),
>  	SMPL_REG(BX, PERF_REG_X86_BX),
> @@ -276,27 +319,159 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>  	return SDT_ARG_VALID;
>  }
>  
> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> +{
> +	struct perf_event_attr attr = {
> +		.type				= PERF_TYPE_HARDWARE,
> +		.config				= PERF_COUNT_HW_CPU_CYCLES,
> +		.sample_type			= sample_type,
> +		.disabled 			= 1,
> +		.exclude_kernel			= 1,
> +		.sample_simd_regs_enabled	= 1,
> +	};
> +	int fd;
> +
> +	attr.sample_period = 1;
> +
> +	if (!pred) {
> +		attr.sample_simd_vec_reg_qwords = qwords;
> +		if (sample_type == PERF_SAMPLE_REGS_INTR)
> +			attr.sample_simd_vec_reg_intr = mask;
> +		else
> +			attr.sample_simd_vec_reg_user = mask;
> +	} else {
> +		attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> +		if (sample_type == PERF_SAMPLE_REGS_INTR)
> +			attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> +		else
> +			attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> +	}
> +
> +	if (perf_pmus__num_core_pmus() > 1) {
> +		struct perf_pmu *pmu = NULL;
> +		__u64 type = PERF_TYPE_RAW;
> +
> +		/*
> +		 * The same register set is supported among different hybrid PMUs.
> +		 * Only check the first available one.
> +		 */
> +		while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> +			type = pmu->type;
> +			break;
> +		}
> +		attr.config |= type << PERF_PMU_TYPE_SHIFT;
> +	}
> +
> +	event_attr_init(&attr);
> +
> +	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> +	if (fd != -1) {
> +		close(fd);
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +static uint64_t intr_simd_mask, user_simd_mask, pred_mask;
> +static u16	intr_simd_qwords, user_simd_qwords, pred_qwords;
> +
> +static bool get_simd_reg_mask(u64 sample_type)
> +{
> +	u64 mask = GENMASK_ULL(PERF_X86_H16ZMM_BASE - 1, 0);
> +	u16 qwords = PERF_X86_ZMM_QWORDS;
> +
> +	if (support_simd_reg(sample_type, qwords, mask, false)) {
> +		if (support_simd_reg(sample_type, qwords, PERF_X86_SIMD_VEC_MASK, false))
> +			mask = PERF_X86_SIMD_VEC_MASK;
> +	} else {
> +		qwords = PERF_X86_YMM_QWORDS;
> +		if (!support_simd_reg(sample_type, qwords, mask, false)) {
> +			qwords = PERF_X86_XMM_QWORDS;
> +			if (!support_simd_reg(sample_type, qwords, mask, false)) {
> +				qwords = 0;
> +				mask = 0;
> +			}
> +		}
> +	}
> +
> +	if (sample_type == PERF_SAMPLE_REGS_INTR) {
> +		intr_simd_mask = mask;
> +		intr_simd_qwords = qwords;
> +	} else {
> +		user_simd_mask = mask;
> +		user_simd_qwords = qwords;
> +	}
> +
> +	if (support_simd_reg(sample_type, qwords, mask, true)) {
> +		pred_mask = PERF_X86_SIMD_PRED_MASK;
> +		pred_qwords = PERF_X86_OPMASK_QWORDS;
> +	}
> +
> +	return true;
> +}
> +
> +static bool has_cap_simd_regs(void)
> +{
> +	static bool has_cap_simd_regs;
> +	static bool cached;
> +
> +	if (cached)
> +		return has_cap_simd_regs;
> +
> +	cached = true;
> +	has_cap_simd_regs = get_simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> +	has_cap_simd_regs |= get_simd_reg_mask(PERF_SAMPLE_REGS_USER);
> +
> +	return has_cap_simd_regs;
> +}
> +
>  const struct sample_reg *arch__sample_reg_masks(void)
>  {
> +	if (has_cap_simd_regs())
> +		return sample_reg_masks_ext;
>  	return sample_reg_masks;
>  }
>  
> -uint64_t arch__intr_reg_mask(void)
> +static const struct sample_reg sample_simd_reg_masks_empty[] = {
> +	SMPL_REG_END
> +};
> +
> +static const struct sample_reg sample_simd_reg_masks[] = {
> +	SMPL_REG(XMM, 1),
> +	SMPL_REG(YMM, 2),
> +	SMPL_REG(ZMM, 3),
> +	SMPL_REG(OPMASK, 32),
> +	SMPL_REG_END
> +};
> +
> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> +{
> +	if (has_cap_simd_regs())
> +		return sample_simd_reg_masks;
> +	return sample_simd_reg_masks_empty;
> +}
> +
> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>  {
>  	struct perf_event_attr attr = {
> -		.type			= PERF_TYPE_HARDWARE,
> -		.config			= PERF_COUNT_HW_CPU_CYCLES,
> -		.sample_type		= PERF_SAMPLE_REGS_INTR,
> -		.sample_regs_intr	= PERF_REG_EXTENDED_MASK,
> -		.precise_ip		= 1,
> -		.disabled 		= 1,
> -		.exclude_kernel		= 1,
> +		.type				= PERF_TYPE_HARDWARE,
> +		.config				= PERF_COUNT_HW_CPU_CYCLES,
> +		.sample_type			= sample_type,
> +		.precise_ip			= 1,
> +		.disabled 			= 1,
> +		.exclude_kernel			= 1,
> +		.sample_simd_regs_enabled	= has_simd_regs,
>  	};
>  	int fd;
>  	/*
>  	 * In an unnamed union, init it here to build on older gcc versions
>  	 */
>  	attr.sample_period = 1;
> +	if (sample_type == PERF_SAMPLE_REGS_INTR)
> +		attr.sample_regs_intr = mask;
> +	else
> +		attr.sample_regs_user = mask;
>  
>  	if (perf_pmus__num_core_pmus() > 1) {
>  		struct perf_pmu *pmu = NULL;
> @@ -318,13 +493,73 @@ uint64_t arch__intr_reg_mask(void)
>  	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>  	if (fd != -1) {
>  		close(fd);
> -		return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> +		return mask;
>  	}
>  
> -	return PERF_REGS_MASK;
> +	return 0;
> +}
> +
> +uint64_t arch__intr_reg_mask(void)
> +{
> +	uint64_t mask = PERF_REGS_MASK;
> +
> +	if (has_cap_simd_regs()) {
> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> +					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> +					 true);
> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> +					 BIT_ULL(PERF_REG_X86_SSP),
> +					 true);
> +	} else
> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> +
> +	return mask;
>  }
>  
>  uint64_t arch__user_reg_mask(void)
>  {
> -	return PERF_REGS_MASK;
> +	uint64_t mask = PERF_REGS_MASK;
> +
> +	if (has_cap_simd_regs()) {
> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> +					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> +					 true);
> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> +					 BIT_ULL(PERF_REG_X86_SSP),
> +					 true);
> +	}
> +
> +	return mask;
> +}
> +
> +uint64_t arch__intr_simd_reg_mask(u16 *qwords)
> +{
> +	if (!has_cap_simd_regs())
> +		return 0;
> +	*qwords = intr_simd_qwords;
> +	return intr_simd_mask;
> +}
> +
> +uint64_t arch__user_simd_reg_mask(u16 *qwords)
> +{
> +	if (!has_cap_simd_regs())
> +		return 0;
> +	*qwords = user_simd_qwords;
> +	return user_simd_mask;
> +}
> +
> +uint64_t arch__intr_pred_reg_mask(u16 *qwords)
> +{
> +	if (!has_cap_simd_regs())
> +		return 0;
> +	*qwords = pred_qwords;
> +	return pred_mask;
> +}
> +
> +uint64_t arch__user_pred_reg_mask(u16 *qwords)
> +{
> +	if (!has_cap_simd_regs())
> +		return 0;
> +	*qwords = pred_qwords;
> +	return pred_mask;
>  }
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index d55482f094bf..af6e1c843fc5 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1402,12 +1402,37 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>  		evsel__set_sample_bit(evsel, REGS_INTR);
>  	}
>  
> +	if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> +	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> +		/* The pred qwords is to implies the set of SIMD registers is used */
> +		if (opts->sample_pred_regs_qwords)
> +			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> +		else
> +			attr->sample_simd_pred_reg_qwords = 1;
> +		attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> +		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> +		attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> +		evsel__set_sample_bit(evsel, REGS_INTR);
> +	}
> +
>  	if (opts->sample_user_regs && !evsel->no_aux_samples &&
>  	    !evsel__is_dummy_event(evsel)) {
>  		attr->sample_regs_user |= opts->sample_user_regs;
>  		evsel__set_sample_bit(evsel, REGS_USER);
>  	}
>  
> +	if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> +	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> +		if (opts->sample_pred_regs_qwords)
> +			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> +		else
> +			attr->sample_simd_pred_reg_qwords = 1;
> +		attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> +		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> +		attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> +		evsel__set_sample_bit(evsel, REGS_USER);
> +	}
> +
>  	if (target__has_cpu(&opts->target) || opts->sample_cpu)
>  		evsel__set_sample_bit(evsel, CPU);
>  
> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> index cda1c620968e..27266038352f 100644
> --- a/tools/perf/util/parse-regs-options.c
> +++ b/tools/perf/util/parse-regs-options.c
> @@ -4,20 +4,26 @@
>  #include <stdint.h>
>  #include <string.h>
>  #include <stdio.h>
> +#include <linux/bitops.h>
>  #include "util/debug.h"
>  #include <subcmd/parse-options.h>
>  #include "util/perf_regs.h"
>  #include "util/parse-regs-options.h"
> +#include "record.h"
>  
>  static int
>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>  {
>  	uint64_t *mode = (uint64_t *)opt->value;
>  	const struct sample_reg *r = NULL;
> +	u16 simd_qwords, pred_qwords;
> +	u64 simd_mask, pred_mask;
> +	struct record_opts *opts;
>  	char *s, *os = NULL, *p;
>  	int ret = -1;
>  	uint64_t mask;
>  
> +
>  	if (unset)
>  		return 0;
>  
> @@ -27,10 +33,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>  	if (*mode)
>  		return -1;
>  
> -	if (intr)
> +	if (intr) {
> +		opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>  		mask = arch__intr_reg_mask();
> -	else
> +		simd_mask = arch__intr_simd_reg_mask(&simd_qwords);
> +		pred_mask = arch__intr_pred_reg_mask(&pred_qwords);
> +	} else {
> +		opts = container_of(opt->value, struct record_opts, sample_user_regs);
>  		mask = arch__user_reg_mask();
> +		simd_mask = arch__user_simd_reg_mask(&simd_qwords);
> +		pred_mask = arch__user_pred_reg_mask(&pred_qwords);
> +	}
>  
>  	/* str may be NULL in case no arg is passed to -I */
>  	if (str) {
> @@ -50,10 +63,51 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>  					if (r->mask & mask)
>  						fprintf(stderr, "%s ", r->name);
>  				}
> +				for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +					if (pred_qwords == r->qwords.pred) {
> +						fprintf(stderr, "%s0-%d ", r->name, fls64(pred_mask) - 1);
> +						continue;
> +					}
> +					if (simd_qwords >= r->mask)
> +						fprintf(stderr, "%s0-%d ", r->name, fls64(simd_mask) - 1);
> +				}
> +
>  				fputc('\n', stderr);
>  				/* just printing available regs */
>  				goto error;
>  			}
> +
> +			if (simd_mask || pred_mask) {
> +				u16 vec_regs_qwords = 0, pred_regs_qwords = 0;
> +
> +				for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +					if (!strcasecmp(s, r->name)) {
> +						vec_regs_qwords = r->qwords.vec;
> +						pred_regs_qwords = r->qwords.pred;
> +						break;
> +					}
> +				}
> +
> +				/* Just need the highest qwords */
> +				if (vec_regs_qwords > opts->sample_vec_regs_qwords) {
> +					opts->sample_vec_regs_qwords = vec_regs_qwords;
> +					if (intr)
> +						opts->sample_intr_vec_regs = simd_mask;
> +					else
> +						opts->sample_user_vec_regs = simd_mask;
> +				}
> +				if (pred_regs_qwords > opts->sample_pred_regs_qwords) {
> +					opts->sample_pred_regs_qwords = pred_regs_qwords;
> +					if (intr)
> +						opts->sample_intr_pred_regs = pred_mask;
> +					else
> +						opts->sample_user_pred_regs = pred_mask;
> +				}
> +
> +				if (r->name)
> +					goto next;
> +			}
> +
>  			for (r = arch__sample_reg_masks(); r->name; r++) {
>  				if ((r->mask & mask) && !strcasecmp(s, r->name))
>  					break;
> @@ -65,7 +119,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>  			}
>  
>  			*mode |= r->mask;
> -
> +next:
>  			if (!p)
>  				break;
>  
> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> index 66b666d9ce64..fb0366d050cf 100644
> --- a/tools/perf/util/perf_event_attr_fprintf.c
> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>  	PRINT_ATTRf(aux_start_paused, p_unsigned);
>  	PRINT_ATTRf(aux_pause, p_unsigned);
>  	PRINT_ATTRf(aux_resume, p_unsigned);
> +	PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> +	PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> +	PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> +	PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> +	PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> +	PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>  
>  	return ret;
>  }
> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> index 44b90bbf2d07..0744c77b4ac8 100644
> --- a/tools/perf/util/perf_regs.c
> +++ b/tools/perf/util/perf_regs.c
> @@ -21,6 +21,30 @@ uint64_t __weak arch__user_reg_mask(void)
>  	return 0;
>  }
>  
> +uint64_t __weak arch__intr_simd_reg_mask(u16 *qwords)
> +{
> +	*qwords = 0;
> +	return 0;
> +}
> +
> +uint64_t __weak arch__user_simd_reg_mask(u16 *qwords)
> +{
> +	*qwords = 0;
> +	return 0;
> +}
> +
> +uint64_t __weak arch__intr_pred_reg_mask(u16 *qwords)
> +{
> +	*qwords = 0;
> +	return 0;
> +}
> +
> +uint64_t __weak arch__user_pred_reg_mask(u16 *qwords)
> +{
> +	*qwords = 0;
> +	return 0;
> +}
> +
>  static const struct sample_reg sample_reg_masks[] = {
>  	SMPL_REG_END
>  };
> @@ -30,6 +54,11 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>  	return sample_reg_masks;
>  }
>  
> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> +{
> +	return sample_reg_masks;
> +}
> +
>  const char *perf_reg_name(int id, const char *arch)
>  {
>  	const char *reg_name = NULL;
> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> index f2d0736d65cc..b932caa73a8a 100644
> --- a/tools/perf/util/perf_regs.h
> +++ b/tools/perf/util/perf_regs.h
> @@ -9,7 +9,13 @@ struct regs_dump;
>  
>  struct sample_reg {
>  	const char *name;
> -	uint64_t mask;
> +	union {
> +		struct {
> +			uint32_t vec;
> +			uint32_t pred;
> +		} qwords;
> +		uint64_t mask;
> +	};
>  };
>  
>  #define SMPL_REG_MASK(b) (1ULL << (b))
> @@ -27,6 +33,11 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>  uint64_t arch__intr_reg_mask(void);
>  uint64_t arch__user_reg_mask(void);
>  const struct sample_reg *arch__sample_reg_masks(void);
> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> +uint64_t arch__intr_simd_reg_mask(u16 *qwords);
> +uint64_t arch__user_simd_reg_mask(u16 *qwords);
> +uint64_t arch__intr_pred_reg_mask(u16 *qwords);
> +uint64_t arch__user_pred_reg_mask(u16 *qwords);
>  
>  const char *perf_reg_name(int id, const char *arch);
>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> index ea3a6c4657ee..825ffb4cc53f 100644
> --- a/tools/perf/util/record.h
> +++ b/tools/perf/util/record.h
> @@ -59,7 +59,13 @@ struct record_opts {
>  	unsigned int  user_freq;
>  	u64	      branch_stack;
>  	u64	      sample_intr_regs;
> +	u64	      sample_intr_vec_regs;
>  	u64	      sample_user_regs;
> +	u64	      sample_user_vec_regs;
> +	u16	      sample_pred_regs_qwords;
> +	u16	      sample_vec_regs_qwords;
> +	u16	      sample_intr_pred_regs;
> +	u16	      sample_user_pred_regs;
>  	u64	      default_interval;
>  	u64	      user_interval;
>  	size_t	      auxtrace_snapshot_size;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V3 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER
  2025-08-20  9:46       ` Mi, Dapeng
@ 2025-08-20 18:03         ` Liang, Kan
  2025-08-21  1:00           ` Mi, Dapeng
  0 siblings, 1 reply; 32+ messages in thread
From: Liang, Kan @ 2025-08-20 18:03 UTC (permalink / raw)
  To: Mi, Dapeng, Peter Zijlstra
  Cc: mingo, acme, namhyung, tglx, dave.hansen, irogers, adrian.hunter,
	jolsa, alexander.shishkin, linux-kernel, ak, zide.chen,
	mark.rutland, broonie, ravi.bangoria, eranian



On 2025-08-20 2:46 a.m., Mi, Dapeng wrote:
> 
> On 8/19/2025 11:55 PM, Liang, Kan wrote:
>>
>> On 2025-08-19 6:39 a.m., Peter Zijlstra wrote:
>>> On Fri, Aug 15, 2025 at 02:34:23PM -0700, kan.liang@linux.intel.com wrote:
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> Collecting the XMM registers in a PEBS record has been supported since
>>>> the Icelake. But non-PEBS events don't support the feature. It's
>>>> possible to retrieve the XMM registers from the XSAVE for non-PEBS.
>>>> Add it to make the feature complete.
>>>>
>>>> To utilize the XSAVE, a 64-byte aligned buffer is required. Add a
>>>> per-CPU ext_regs_buf to store the vector registers. The size of the
>>>> buffer is ~2K. kzalloc_node() is used because there's a _guarantee_
>>>> that all kmalloc()'s with powers of 2 are naturally aligned and also
>>>> 64b aligned.
>>>>
>>>> Extend the support for both REGS_USER and REGS_INTR. For REGS_USER, the
>>>> perf_get_regs_user() returns the regs from the task_pt_regs(current),
>>>> which is struct pt_regs. Need to move it to local struct x86_perf_regs
>>>> x86_user_regs.
>>>> For PEBS, the HW support is still preferred. The XMM should be retrieved
>>>> from PEBS records.
>>>>
>>>> There could be more vector registers supported later. Add ext_regs_mask
>>>> to track the supported vector register group.
>>>
>>> I'm a little confused... *again* :-)
>>>
>>> Specifically, we should consider two sets of registers:
>>>
>>>  - the live set, as per the CPU (XSAVE)
>>>  - the stored set, as per x86_task_fpu()
>>>
>>> regs_intr should always get a copy of the live set; however
>>> regs_user should not. It might need a copy of the x86_task_fpu() instead
>>> of the live set, depending on TIF_NEED_FPU_LOAD (more or less, we need
>>> another variable set in kernel_fpu_begin_mask() *after*
>>> save_fpregs_to_fpstate() is completed).
>>>
>>> I don't see this code make this distinction.
>>>
>>> Consider getting a sample while the kernel is doing some avx enhanced
>>> crypto and such.
>> The regs_user only needs a set when the NMI hits the user mode
>> (user_mode(regs)) or a non-kernel thread (!(current->flags &
>> PF_KTHREAD)). The live set is good enough for both cases.
> 
> It's fine if NMI hits user mode, but if NMI hits the kernel mode
> (!(current->flags &PF_KTHREAD)), won't the kernel space SIMD/eGPR regs be
> exposed to user space for user-regs option? I'm not sure if kernel space
> really use these SIMD/eGPR regs right now, but it seems a risk.
> 
>

I don't think it's possible for the existing kernel. But I cannot
guarantee future usage.

If the kernel mode handling is still a concern, I think we should drop
the SIMD/eGPR regs for the case for now. Because
- To profile a userspace application which requires SIMD/eGPR regs, the
NMI usually hits the usersapce. It's not common to hit the kernel mode.
- The SIMD/eGPR cannot be retrieved from the task_pt_regs(). Although
it's possible to retrieve the values when the TIF_NEED_FPU_LOAD flag is
set, I don't think it's worth introducing such complexity to handle an
uncommon case in the critical path.
- Furthermore, only checking the TIF_NEED_FPU_LOAD flag cannot cure
everything. Some corner cases cannot be handled either. For example, an
NMI can happen when the flag just switched, but still in the kernel mode.

We can always add the support later if someone thinks it's important to
retrieve the user SIMD/eGPR regs during the kernel syscall.

Thanks,
Kan
>>
>> I think the kernel crypto should be to a kernel thread (current->flags &
>> PF_KTHREAD). If so, the regs_user should return NULL.
>>
>> Thanks,
>> Kan
>>
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V3 06/17] perf: Support SIMD registers
  2025-08-20  9:55   ` Mi, Dapeng
@ 2025-08-20 18:08     ` Liang, Kan
  0 siblings, 0 replies; 32+ messages in thread
From: Liang, Kan @ 2025-08-20 18:08 UTC (permalink / raw)
  To: Mi, Dapeng, peterz, mingo, acme, namhyung, tglx, dave.hansen,
	irogers, adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: ak, zide.chen, mark.rutland, broonie, ravi.bangoria, eranian



On 2025-08-20 2:55 a.m., Mi, Dapeng wrote:
> 
> On 8/16/2025 5:34 AM, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> The users may be interested in the SIMD registers in a sample while
>> profiling. The current sample_regs_XXX doesn't have enough space for all
>> SIMD registers.
>>
>> Add sets of the sample_simd_{pred,vec}_reg_* in the
>> struct perf_event_attr to define a set of SIMD registers to dump on
>> samples.
>> The current X86 supports the XMM registers in sample_regs_XXX. To
>> utilize the new SIMD registers configuration method, the
>> sample_simd_regs_enabled should always be set. If so, the XMM space in
>> the sample_regs_XXX is reserved for other usage.
>>
>> The SIMD registers are wider than 64. A new output format is introduced.
>> The number and width of SIMD registers will be dumped first, following
>> the register values. The number and width are the same as the user's
>> configuration now. If, for some reason (e.g., ARM) they are different,
>> an ARCH-specific perf_output_sample_simd_regs can be implemented later
>> separately.
>> Add a new ABI, PERF_SAMPLE_REGS_ABI_SIMD, to indicate the new format.
>> The enum perf_sample_regs_abi becomes a bitmap now. There should be no
>> impact on the existing tool, since the version and bitmap are the same
>> for 1 and 2.
>>
>> Add three new __weak functions to retrieve the number of available
>> registers, validate the configuration of the SIMD registers, and
>> retrieve the SIMD registers. The ARCH-specific functions will be
>> implemented in the following patches.
>>
>> Add a new flag PERF_PMU_CAP_SIMD_REGS to indicate that the PMU has the
>> capability to support SIMD registers dumping. Error out if the
>> sample_simd_{pred,vec}_reg_* mistakenly set for a PMU that doesn't have
>> the capability.
>>
>> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>>  include/linux/perf_event.h      |  13 ++++
>>  include/linux/perf_regs.h       |   9 +++
>>  include/uapi/linux/perf_event.h |  47 +++++++++++++--
>>  kernel/events/core.c            | 101 +++++++++++++++++++++++++++++++-
>>  4 files changed, 162 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 444b162f3f92..205361b7de2e 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -305,6 +305,7 @@ struct perf_event_pmu_context;
>>  #define PERF_PMU_CAP_EXTENDED_HW_TYPE	0x0100
>>  #define PERF_PMU_CAP_AUX_PAUSE		0x0200
>>  #define PERF_PMU_CAP_AUX_PREFER_LARGE	0x0400
>> +#define PERF_PMU_CAP_SIMD_REGS		0x0800
>>  
>>  /**
>>   * pmu::scope
>> @@ -1526,6 +1527,18 @@ perf_event__output_id_sample(struct perf_event *event,
>>  extern void
>>  perf_log_lost_samples(struct perf_event *event, u64 lost);
>>  
>> +static inline bool event_has_simd_regs(struct perf_event *event)
>> +{
>> +	struct perf_event_attr *attr = &event->attr;
>> +
>> +	return attr->sample_simd_regs_enabled != 0 ||
>> +	       attr->sample_simd_pred_reg_intr != 0 ||
>> +	       attr->sample_simd_pred_reg_user != 0 ||
>> +	       attr->sample_simd_vec_reg_qwords != 0 ||
>> +	       attr->sample_simd_vec_reg_intr != 0 ||
>> +	       attr->sample_simd_vec_reg_user != 0;
>> +}
>> +
>>  static inline bool event_has_extended_regs(struct perf_event *event)
>>  {
>>  	struct perf_event_attr *attr = &event->attr;
>> diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
>> index f632c5725f16..0172682b18fd 100644
>> --- a/include/linux/perf_regs.h
>> +++ b/include/linux/perf_regs.h
>> @@ -9,6 +9,15 @@ struct perf_regs {
>>  	struct pt_regs	*regs;
>>  };
>>  
>> +int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
>> +			   u16 pred_qwords, u32 pred_mask);
>> +u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
>> +			u16 qwords_idx, bool pred);
>> +void perf_simd_reg_check(struct pt_regs *regs,
>> +			 u64 mask, u16 *nr_vectors, u16 *vec_qwords,
>> +			 u16 pred_mask, u16 *nr_pred, u16 *pred_qwords);
>> +
>> +
>>  #ifdef CONFIG_HAVE_PERF_REGS
>>  #include <asm/perf_regs.h>
>>  
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index 78a362b80027..2e9b16acbed6 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -313,9 +313,10 @@ enum {
>>   * Values to determine ABI of the registers dump.
>>   */
>>  enum perf_sample_regs_abi {
>> -	PERF_SAMPLE_REGS_ABI_NONE		= 0,
>> -	PERF_SAMPLE_REGS_ABI_32			= 1,
>> -	PERF_SAMPLE_REGS_ABI_64			= 2,
>> +	PERF_SAMPLE_REGS_ABI_NONE		= 0x00,
>> +	PERF_SAMPLE_REGS_ABI_32			= 0x01,
>> +	PERF_SAMPLE_REGS_ABI_64			= 0x02,
>> +	PERF_SAMPLE_REGS_ABI_SIMD		= 0x04,
> 
> Better change the definition to bitmap format, so it clearly indicates the
> ABI is a bitmap format.
> 
> enum perf_sample_regs_abi {
>     PERF_SAMPLE_REGS_ABI_NONE        = 0,
>     PERF_SAMPLE_REGS_ABI_32            = 1 << 0,
>     PERF_SAMPLE_REGS_ABI_64            = 1 << 1,
>     PERF_SAMPLE_REGS_ABI_SIMD        = 1 << 2,
> };
> 
> 

BIT_ULL() should be better.

Thanks,
Kan

> 
>>  };
>>  
>>  /*
>> @@ -382,6 +383,7 @@ enum perf_event_read_format {
>>  #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
>>  #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
>>  #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
>> +#define PERF_ATTR_SIZE_VER9			168	/* Add: sample_simd_{pred,vec}_reg_* */
>>  
>>  /*
>>   * 'struct perf_event_attr' contains various attributes that define
>> @@ -543,6 +545,25 @@ struct perf_event_attr {
>>  	__u64	sig_data;
>>  
>>  	__u64	config3; /* extension of config2 */
>> +
>> +
>> +	/*
>> +	 * Defines set of SIMD registers to dump on samples.
>> +	 * The sample_simd_regs_enabled !=0 implies the
>> +	 * set of SIMD registers is used to config all SIMD registers.
>> +	 * If !sample_simd_regs_enabled, sample_regs_XXX may be used to
>> +	 * config some SIMD registers on X86.
>> +	 */
>> +	union {
>> +		__u16 sample_simd_regs_enabled;
>> +		__u16 sample_simd_pred_reg_qwords;
>> +	};
>> +	__u32 sample_simd_pred_reg_intr;
>> +	__u32 sample_simd_pred_reg_user;
>> +	__u16 sample_simd_vec_reg_qwords;
>> +	__u64 sample_simd_vec_reg_intr;
>> +	__u64 sample_simd_vec_reg_user;
>> +	__u32 __reserved_4;
>>  };
>>  
>>  /*
>> @@ -1016,7 +1037,15 @@ enum perf_event_type {
>>  	 *      } && PERF_SAMPLE_BRANCH_STACK
>>  	 *
>>  	 *	{ u64			abi; # enum perf_sample_regs_abi
>> -	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
>> +	 *	  u64			regs[weight(mask)];
>> +	 *	  struct {
>> +	 *		u16 nr_vectors;
>> +	 *		u16 vector_qwords;
>> +	 *		u16 nr_pred;
>> +	 *		u16 pred_qwords;
>> +	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +	 *	} && PERF_SAMPLE_REGS_USER
>>  	 *
>>  	 *	{ u64			size;
>>  	 *	  char			data[size];
>> @@ -1043,7 +1072,15 @@ enum perf_event_type {
>>  	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
>>  	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
>>  	 *	{ u64			abi; # enum perf_sample_regs_abi
>> -	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>> +	 *	  u64			regs[weight(mask)];
>> +	 *	  struct {
>> +	 *		u16 nr_vectors;
>> +	 *		u16 vector_qwords;
>> +	 *		u16 nr_pred;
>> +	 *		u16 pred_qwords;
>> +	 *		u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +	 *	  } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>> +	 *	} && PERF_SAMPLE_REGS_INTR
>>  	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>>  	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
>>  	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 95a7b6f5af09..dd8cf3c7fb7a 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -7408,6 +7408,47 @@ perf_output_sample_regs(struct perf_output_handle *handle,
>>  	}
>>  }
>>  
>> +static void
>> +perf_output_sample_simd_regs(struct perf_output_handle *handle,
>> +			     struct perf_event *event,
>> +			     struct pt_regs *regs,
>> +			     u64 mask, u16 pred_mask)
>> +{
>> +	u16 pred_qwords = event->attr.sample_simd_pred_reg_qwords;
>> +	u16 vec_qwords = event->attr.sample_simd_vec_reg_qwords;
>> +	u16 nr_pred = hweight16(pred_mask);
>> +	u16 nr_vectors = hweight64(mask);
>> +	int bit;
>> +	u64 val;
>> +	u16 i;
>> +
>> +	/* Get the number of available regs */
>> +	perf_simd_reg_check(regs, mask, &nr_vectors, &vec_qwords,
>> +			    pred_mask, &nr_pred, &pred_qwords);
>> +
>> +	perf_output_put(handle, nr_vectors);
>> +	perf_output_put(handle, vec_qwords);
>> +	perf_output_put(handle, nr_pred);
>> +	perf_output_put(handle, pred_qwords);
>> +
>> +	if (nr_vectors) {
>> +		for_each_set_bit(bit, (unsigned long *)&mask, sizeof(mask) * BITS_PER_BYTE) {
>> +			for (i = 0; i < vec_qwords; i++) {
>> +				val = perf_simd_reg_value(regs, bit, i, false);
>> +				perf_output_put(handle, val);
>> +			}
>> +		}
>> +	}
>> +	if (nr_pred) {
>> +		for_each_set_bit(bit, (unsigned long *)&pred_mask, sizeof(pred_mask) * BITS_PER_BYTE) {
>> +			for (i = 0; i < pred_qwords; i++) {
>> +				val = perf_simd_reg_value(regs, bit, i, true);
>> +				perf_output_put(handle, val);
>> +			}
>> +		}
>> +	}
>> +}
>> +
>>  static void perf_sample_regs_user(struct perf_regs *regs_user,
>>  				  struct pt_regs *regs)
>>  {
>> @@ -7429,6 +7470,25 @@ static void perf_sample_regs_intr(struct perf_regs *regs_intr,
>>  	regs_intr->abi  = perf_reg_abi(current);
>>  }
>>  
>> +int __weak perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
>> +				  u16 pred_qwords, u32 pred_mask)
>> +{
>> +	return vec_qwords || vec_mask || pred_qwords || pred_mask ? -ENOSYS : 0;
>> +}
>> +
>> +u64 __weak perf_simd_reg_value(struct pt_regs *regs, int idx,
>> +			       u16 qwords_idx, bool pred)
>> +{
>> +	return 0;
>> +}
>> +
>> +void __weak perf_simd_reg_check(struct pt_regs *regs,
>> +				u64 mask, u16 *nr_vectors, u16 *vec_qwords,
>> +				u16 pred_mask, u16 *nr_pred, u16 *pred_qwords)
>> +{
>> +	*nr_vectors = 0;
>> +	*nr_pred = 0;
>> +}
>>  
>>  /*
>>   * Get remaining task size from user stack pointer.
>> @@ -7961,10 +8021,17 @@ void perf_output_sample(struct perf_output_handle *handle,
>>  		perf_output_put(handle, abi);
>>  
>>  		if (abi) {
>> -			u64 mask = event->attr.sample_regs_user;
>> +			struct perf_event_attr *attr = &event->attr;
>> +			u64 mask = attr->sample_regs_user;
>>  			perf_output_sample_regs(handle,
>>  						data->regs_user.regs,
>>  						mask);
>> +			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
>> +				perf_output_sample_simd_regs(handle, event,
>> +							     data->regs_user.regs,
>> +							     attr->sample_simd_vec_reg_user,
>> +							     attr->sample_simd_pred_reg_user);
>> +			}
>>  		}
>>  	}
>>  
>> @@ -7992,11 +8059,18 @@ void perf_output_sample(struct perf_output_handle *handle,
>>  		perf_output_put(handle, abi);
>>  
>>  		if (abi) {
>> -			u64 mask = event->attr.sample_regs_intr;
>> +			struct perf_event_attr *attr = &event->attr;
>> +			u64 mask = attr->sample_regs_intr;
>>  
>>  			perf_output_sample_regs(handle,
>>  						data->regs_intr.regs,
>>  						mask);
>> +			if (abi & PERF_SAMPLE_REGS_ABI_SIMD) {
>> +				perf_output_sample_simd_regs(handle, event,
>> +							     data->regs_intr.regs,
>> +							     attr->sample_simd_vec_reg_intr,
>> +							     attr->sample_simd_pred_reg_intr);
>> +			}
>>  		}
>>  	}
>>  
>> @@ -12560,6 +12634,12 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
>>  	if (ret)
>>  		goto err_pmu;
>>  
>> +	if (!(pmu->capabilities & PERF_PMU_CAP_SIMD_REGS) &&
>> +	    event_has_simd_regs(event)) {
>> +		ret = -EOPNOTSUPP;
>> +		goto err_destroy;
>> +	}
>> +
>>  	if (!(pmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS) &&
>>  	    event_has_extended_regs(event)) {
>>  		ret = -EOPNOTSUPP;
>> @@ -13101,6 +13181,12 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
>>  		ret = perf_reg_validate(attr->sample_regs_user);
>>  		if (ret)
>>  			return ret;
>> +		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
>> +					     attr->sample_simd_vec_reg_user,
>> +					     attr->sample_simd_pred_reg_qwords,
>> +					     attr->sample_simd_pred_reg_user);
>> +		if (ret)
>> +			return ret;
>>  	}
>>  
>>  	if (attr->sample_type & PERF_SAMPLE_STACK_USER) {
>> @@ -13121,8 +13207,17 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
>>  	if (!attr->sample_max_stack)
>>  		attr->sample_max_stack = sysctl_perf_event_max_stack;
>>  
>> -	if (attr->sample_type & PERF_SAMPLE_REGS_INTR)
>> +	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
>>  		ret = perf_reg_validate(attr->sample_regs_intr);
>> +		if (ret)
>> +			return ret;
>> +		ret = perf_simd_reg_validate(attr->sample_simd_vec_reg_qwords,
>> +					     attr->sample_simd_vec_reg_intr,
>> +					     attr->sample_simd_pred_reg_qwords,
>> +					     attr->sample_simd_pred_reg_intr);
>> +		if (ret)
>> +			return ret;
>> +	}
>>  
>>  #ifndef CONFIG_CGROUP_PERF
>>  	if (attr->sample_type & PERF_SAMPLE_CGROUP)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V3 08/17] perf/x86: Add YMM into sample_simd_vec_regs
  2025-08-20  9:59   ` Mi, Dapeng
@ 2025-08-20 18:10     ` Liang, Kan
  0 siblings, 0 replies; 32+ messages in thread
From: Liang, Kan @ 2025-08-20 18:10 UTC (permalink / raw)
  To: Mi, Dapeng, peterz, mingo, acme, namhyung, tglx, dave.hansen,
	irogers, adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: ak, zide.chen, mark.rutland, broonie, ravi.bangoria, eranian



On 2025-08-20 2:59 a.m., Mi, Dapeng wrote:
> 
> On 8/16/2025 5:34 AM, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> The YMM0-15 is composed of XMM and YMMH. It requires 2 XSAVE commands to
>> get the complete value. Internally, the XMM and YMMH are stored in
>> different structures, which follow the XSAVE format. But the output
>> dumps the YMM as a whole.
>>
>> The qwords 4 imply YMM.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>>  arch/x86/events/core.c                | 13 +++++++++++++
>>  arch/x86/include/asm/perf_event.h     |  4 ++++
>>  arch/x86/include/uapi/asm/perf_regs.h |  4 +++-
>>  arch/x86/kernel/perf_regs.c           | 10 +++++++++-
>>  4 files changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 1789b91c95c6..aebd4e56dff1 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -423,6 +423,9 @@ static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
>>  
>>  	if (valid_mask & XFEATURE_MASK_SSE)
>>  		perf_regs->xmm_space = xsave->i387.xmm_space;
>> +
>> +	if (valid_mask & XFEATURE_MASK_YMM)
>> +		perf_regs->ymmh = get_xsave_addr(xsave, XFEATURE_YMM);
>>  }
>>  
>>  static void release_ext_regs_buffers(void)
>> @@ -725,6 +728,9 @@ int x86_pmu_hw_config(struct perf_event *event)
>>  			if (event->attr.sample_simd_vec_reg_qwords >= PERF_X86_XMM_QWORDS &&
>>  			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_SSE))
>>  				return -EINVAL;
>> +			if (event->attr.sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS &&
>> +			    !(x86_pmu.ext_regs_mask & XFEATURE_MASK_YMM))
>> +				return -EINVAL;
>>  		}
>>  	}
>>  	return x86_setup_perfctr(event);
>> @@ -1837,6 +1843,13 @@ void x86_pmu_setup_regs_data(struct perf_event *event,
>>  		mask |= XFEATURE_MASK_SSE;
>>  	}
>>  
>> +	if (attr->sample_simd_regs_enabled) {
>> +		if (attr->sample_simd_vec_reg_qwords >= PERF_X86_YMM_QWORDS) {
>> +			perf_regs->ymmh_regs = NULL;
>> +			mask |= XFEATURE_MASK_YMM;
>> +		}
>> +	}
>> +
>>  	mask &= ~ignore_mask;
>>  	if (mask)
>>  		x86_pmu_get_ext_regs(perf_regs, mask);
>> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
>> index 538219c59979..81e3143fd91a 100644
>> --- a/arch/x86/include/asm/perf_event.h
>> +++ b/arch/x86/include/asm/perf_event.h
>> @@ -597,6 +597,10 @@ struct x86_perf_regs {
>>  		u64	*xmm_regs;
>>  		u32	*xmm_space;	/* for xsaves */
>>  	};
>> +	union {
>> +		u64	*ymmh_regs;
>> +		struct ymmh_struct *ymmh;
>> +	};
>>  };
>>  
>>  extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
>> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
>> index bd8af802f757..feb3e8f80761 100644
>> --- a/arch/x86/include/uapi/asm/perf_regs.h
>> +++ b/arch/x86/include/uapi/asm/perf_regs.h
>> @@ -59,6 +59,8 @@ enum perf_event_x86_regs {
>>  #define PERF_X86_SIMD_VEC_MASK		GENMASK_ULL(PERF_X86_SIMD_VEC_REGS_MAX - 1, 0)
>>  
>>  #define PERF_X86_XMM_QWORDS		2
>> -#define PERF_X86_SIMD_QWORDS_MAX	PERF_X86_XMM_QWORDS
>> +#define PERF_X86_YMM_QWORDS		4
>> +#define PERF_X86_YMMH_QWORDS		(PERF_X86_YMM_QWORDS / 2)
>> +#define PERF_X86_SIMD_QWORDS_MAX	PERF_X86_YMM_QWORDS
>>  
>>  #endif /* _ASM_X86_PERF_REGS_H */
>> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
>> index 397357c5896b..d94bc687e4bf 100644
>> --- a/arch/x86/kernel/perf_regs.c
>> +++ b/arch/x86/kernel/perf_regs.c
>> @@ -66,6 +66,9 @@ void perf_simd_reg_check(struct pt_regs *regs,
>>  	if (*vec_qwords >= PERF_X86_XMM_QWORDS && !perf_regs->xmm_regs)
>>  		*nr_vectors = 0;
>>  
>> +	if (*vec_qwords >= PERF_X86_YMM_QWORDS && !perf_regs->xmm_regs)
> 
> should be "!perf_regs->ymmh_regs"?

Oops, good catch.

Thansk,
Kan

> 
> 
>> +		*vec_qwords = PERF_X86_XMM_QWORDS;
>> +
>>  	*nr_pred = 0;
>>  }
>>  
>> @@ -105,6 +108,10 @@ u64 perf_simd_reg_value(struct pt_regs *regs, int idx,
>>  		if (!perf_regs->xmm_regs)
>>  			return 0;
>>  		return perf_regs->xmm_regs[idx * PERF_X86_XMM_QWORDS + qwords_idx];
>> +	} else if (qwords_idx < PERF_X86_YMM_QWORDS) {
>> +		if (!perf_regs->ymmh_regs)
>> +			return 0;
>> +		return perf_regs->ymmh_regs[idx * PERF_X86_YMMH_QWORDS + qwords_idx - PERF_X86_XMM_QWORDS];
>>  	}
>>  
>>  	return 0;
>> @@ -121,7 +128,8 @@ int perf_simd_reg_validate(u16 vec_qwords, u64 vec_mask,
>>  		if (vec_mask)
>>  			return -EINVAL;
>>  	} else {
>> -		if (vec_qwords != PERF_X86_XMM_QWORDS)
>> +		if (vec_qwords != PERF_X86_XMM_QWORDS &&
>> +		    vec_qwords != PERF_X86_YMM_QWORDS)
>>  			return -EINVAL;
>>  		if (vec_mask & ~PERF_X86_SIMD_VEC_MASK)
>>  			return -EINVAL;


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC PATCH 16/17] perf parse-regs: Support the new SIMD format
  2025-08-20 10:04   ` Mi, Dapeng
@ 2025-08-20 18:18     ` Liang, Kan
  0 siblings, 0 replies; 32+ messages in thread
From: Liang, Kan @ 2025-08-20 18:18 UTC (permalink / raw)
  To: Mi, Dapeng, peterz, mingo, acme, namhyung, tglx, dave.hansen,
	irogers, adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: ak, zide.chen, mark.rutland, broonie, ravi.bangoria, eranian



On 2025-08-20 3:04 a.m., Mi, Dapeng wrote:
> 
> On 8/16/2025 5:34 AM, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Add has_cap_simd_regs() to check if the new SIMD format is available.
>> If yes, get the possible mask and qwords.
>>
>> Add several __weak functions to return qwords and mask for vector and
>> pred registers.
>>
>> Only support collecting the vector and pred as a whole, and only the
>> superset. For example, -I XMM,YMM. Only collect all 16 YMMs.
>>
>> Examples:
>>  $perf record -I?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7
> 
> I still have no time to fully review this patch, but the output on SPR
> seems incorrect.
> 
> ./perf record -I?
> available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10 R11
> R12 R13 R14 R15 XMM0--1 YMM0--1 ZMM0--1

I don't think it can be a platform specific issue. I will take a look
when posting the formal perf tool patches.

Thanks,
Kan>
> 
>>
>>  $perf record --user-regs=?
>>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>>  R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>>  tools/perf/arch/x86/util/perf_regs.c      | 257 +++++++++++++++++++++-
>>  tools/perf/util/evsel.c                   |  25 +++
>>  tools/perf/util/parse-regs-options.c      |  60 ++++-
>>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>>  tools/perf/util/perf_regs.c               |  29 +++
>>  tools/perf/util/perf_regs.h               |  13 +-
>>  tools/perf/util/record.h                  |   6 +
>>  7 files changed, 381 insertions(+), 15 deletions(-)
>>
>> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
>> index 12fd93f04802..78027df1af9a 100644
>> --- a/tools/perf/arch/x86/util/perf_regs.c
>> +++ b/tools/perf/arch/x86/util/perf_regs.c
>> @@ -13,6 +13,49 @@
>>  #include "../../../util/pmu.h"
>>  #include "../../../util/pmus.h"
>>  
>> +static const struct sample_reg sample_reg_masks_ext[] = {
>> +	SMPL_REG(AX, PERF_REG_X86_AX),
>> +	SMPL_REG(BX, PERF_REG_X86_BX),
>> +	SMPL_REG(CX, PERF_REG_X86_CX),
>> +	SMPL_REG(DX, PERF_REG_X86_DX),
>> +	SMPL_REG(SI, PERF_REG_X86_SI),
>> +	SMPL_REG(DI, PERF_REG_X86_DI),
>> +	SMPL_REG(BP, PERF_REG_X86_BP),
>> +	SMPL_REG(SP, PERF_REG_X86_SP),
>> +	SMPL_REG(IP, PERF_REG_X86_IP),
>> +	SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
>> +	SMPL_REG(CS, PERF_REG_X86_CS),
>> +	SMPL_REG(SS, PERF_REG_X86_SS),
>> +#ifdef HAVE_ARCH_X86_64_SUPPORT
>> +	SMPL_REG(R8, PERF_REG_X86_R8),
>> +	SMPL_REG(R9, PERF_REG_X86_R9),
>> +	SMPL_REG(R10, PERF_REG_X86_R10),
>> +	SMPL_REG(R11, PERF_REG_X86_R11),
>> +	SMPL_REG(R12, PERF_REG_X86_R12),
>> +	SMPL_REG(R13, PERF_REG_X86_R13),
>> +	SMPL_REG(R14, PERF_REG_X86_R14),
>> +	SMPL_REG(R15, PERF_REG_X86_R15),
>> +	SMPL_REG(R16, PERF_REG_X86_R16),
>> +	SMPL_REG(R17, PERF_REG_X86_R17),
>> +	SMPL_REG(R18, PERF_REG_X86_R18),
>> +	SMPL_REG(R19, PERF_REG_X86_R19),
>> +	SMPL_REG(R20, PERF_REG_X86_R20),
>> +	SMPL_REG(R21, PERF_REG_X86_R21),
>> +	SMPL_REG(R22, PERF_REG_X86_R22),
>> +	SMPL_REG(R23, PERF_REG_X86_R23),
>> +	SMPL_REG(R24, PERF_REG_X86_R24),
>> +	SMPL_REG(R25, PERF_REG_X86_R25),
>> +	SMPL_REG(R26, PERF_REG_X86_R26),
>> +	SMPL_REG(R27, PERF_REG_X86_R27),
>> +	SMPL_REG(R28, PERF_REG_X86_R28),
>> +	SMPL_REG(R29, PERF_REG_X86_R29),
>> +	SMPL_REG(R30, PERF_REG_X86_R30),
>> +	SMPL_REG(R31, PERF_REG_X86_R31),
>> +	SMPL_REG(SSP, PERF_REG_X86_SSP),
>> +#endif
>> +	SMPL_REG_END
>> +};
>> +
>>  static const struct sample_reg sample_reg_masks[] = {
>>  	SMPL_REG(AX, PERF_REG_X86_AX),
>>  	SMPL_REG(BX, PERF_REG_X86_BX),
>> @@ -276,27 +319,159 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>  	return SDT_ARG_VALID;
>>  }
>>  
>> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
>> +{
>> +	struct perf_event_attr attr = {
>> +		.type				= PERF_TYPE_HARDWARE,
>> +		.config				= PERF_COUNT_HW_CPU_CYCLES,
>> +		.sample_type			= sample_type,
>> +		.disabled 			= 1,
>> +		.exclude_kernel			= 1,
>> +		.sample_simd_regs_enabled	= 1,
>> +	};
>> +	int fd;
>> +
>> +	attr.sample_period = 1;
>> +
>> +	if (!pred) {
>> +		attr.sample_simd_vec_reg_qwords = qwords;
>> +		if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +			attr.sample_simd_vec_reg_intr = mask;
>> +		else
>> +			attr.sample_simd_vec_reg_user = mask;
>> +	} else {
>> +		attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
>> +		if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +			attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
>> +		else
>> +			attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
>> +	}
>> +
>> +	if (perf_pmus__num_core_pmus() > 1) {
>> +		struct perf_pmu *pmu = NULL;
>> +		__u64 type = PERF_TYPE_RAW;
>> +
>> +		/*
>> +		 * The same register set is supported among different hybrid PMUs.
>> +		 * Only check the first available one.
>> +		 */
>> +		while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
>> +			type = pmu->type;
>> +			break;
>> +		}
>> +		attr.config |= type << PERF_PMU_TYPE_SHIFT;
>> +	}
>> +
>> +	event_attr_init(&attr);
>> +
>> +	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>> +	if (fd != -1) {
>> +		close(fd);
>> +		return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +static uint64_t intr_simd_mask, user_simd_mask, pred_mask;
>> +static u16	intr_simd_qwords, user_simd_qwords, pred_qwords;
>> +
>> +static bool get_simd_reg_mask(u64 sample_type)
>> +{
>> +	u64 mask = GENMASK_ULL(PERF_X86_H16ZMM_BASE - 1, 0);
>> +	u16 qwords = PERF_X86_ZMM_QWORDS;
>> +
>> +	if (support_simd_reg(sample_type, qwords, mask, false)) {
>> +		if (support_simd_reg(sample_type, qwords, PERF_X86_SIMD_VEC_MASK, false))
>> +			mask = PERF_X86_SIMD_VEC_MASK;
>> +	} else {
>> +		qwords = PERF_X86_YMM_QWORDS;
>> +		if (!support_simd_reg(sample_type, qwords, mask, false)) {
>> +			qwords = PERF_X86_XMM_QWORDS;
>> +			if (!support_simd_reg(sample_type, qwords, mask, false)) {
>> +				qwords = 0;
>> +				mask = 0;
>> +			}
>> +		}
>> +	}
>> +
>> +	if (sample_type == PERF_SAMPLE_REGS_INTR) {
>> +		intr_simd_mask = mask;
>> +		intr_simd_qwords = qwords;
>> +	} else {
>> +		user_simd_mask = mask;
>> +		user_simd_qwords = qwords;
>> +	}
>> +
>> +	if (support_simd_reg(sample_type, qwords, mask, true)) {
>> +		pred_mask = PERF_X86_SIMD_PRED_MASK;
>> +		pred_qwords = PERF_X86_OPMASK_QWORDS;
>> +	}
>> +
>> +	return true;
>> +}
>> +
>> +static bool has_cap_simd_regs(void)
>> +{
>> +	static bool has_cap_simd_regs;
>> +	static bool cached;
>> +
>> +	if (cached)
>> +		return has_cap_simd_regs;
>> +
>> +	cached = true;
>> +	has_cap_simd_regs = get_simd_reg_mask(PERF_SAMPLE_REGS_INTR);
>> +	has_cap_simd_regs |= get_simd_reg_mask(PERF_SAMPLE_REGS_USER);
>> +
>> +	return has_cap_simd_regs;
>> +}
>> +
>>  const struct sample_reg *arch__sample_reg_masks(void)
>>  {
>> +	if (has_cap_simd_regs())
>> +		return sample_reg_masks_ext;
>>  	return sample_reg_masks;
>>  }
>>  
>> -uint64_t arch__intr_reg_mask(void)
>> +static const struct sample_reg sample_simd_reg_masks_empty[] = {
>> +	SMPL_REG_END
>> +};
>> +
>> +static const struct sample_reg sample_simd_reg_masks[] = {
>> +	SMPL_REG(XMM, 1),
>> +	SMPL_REG(YMM, 2),
>> +	SMPL_REG(ZMM, 3),
>> +	SMPL_REG(OPMASK, 32),
>> +	SMPL_REG_END
>> +};
>> +
>> +const struct sample_reg *arch__sample_simd_reg_masks(void)
>> +{
>> +	if (has_cap_simd_regs())
>> +		return sample_simd_reg_masks;
>> +	return sample_simd_reg_masks_empty;
>> +}
>> +
>> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>>  {
>>  	struct perf_event_attr attr = {
>> -		.type			= PERF_TYPE_HARDWARE,
>> -		.config			= PERF_COUNT_HW_CPU_CYCLES,
>> -		.sample_type		= PERF_SAMPLE_REGS_INTR,
>> -		.sample_regs_intr	= PERF_REG_EXTENDED_MASK,
>> -		.precise_ip		= 1,
>> -		.disabled 		= 1,
>> -		.exclude_kernel		= 1,
>> +		.type				= PERF_TYPE_HARDWARE,
>> +		.config				= PERF_COUNT_HW_CPU_CYCLES,
>> +		.sample_type			= sample_type,
>> +		.precise_ip			= 1,
>> +		.disabled 			= 1,
>> +		.exclude_kernel			= 1,
>> +		.sample_simd_regs_enabled	= has_simd_regs,
>>  	};
>>  	int fd;
>>  	/*
>>  	 * In an unnamed union, init it here to build on older gcc versions
>>  	 */
>>  	attr.sample_period = 1;
>> +	if (sample_type == PERF_SAMPLE_REGS_INTR)
>> +		attr.sample_regs_intr = mask;
>> +	else
>> +		attr.sample_regs_user = mask;
>>  
>>  	if (perf_pmus__num_core_pmus() > 1) {
>>  		struct perf_pmu *pmu = NULL;
>> @@ -318,13 +493,73 @@ uint64_t arch__intr_reg_mask(void)
>>  	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>>  	if (fd != -1) {
>>  		close(fd);
>> -		return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
>> +		return mask;
>>  	}
>>  
>> -	return PERF_REGS_MASK;
>> +	return 0;
>> +}
>> +
>> +uint64_t arch__intr_reg_mask(void)
>> +{
>> +	uint64_t mask = PERF_REGS_MASK;
>> +
>> +	if (has_cap_simd_regs()) {
>> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>> +					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>> +					 true);
>> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
>> +					 BIT_ULL(PERF_REG_X86_SSP),
>> +					 true);
>> +	} else
>> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
>> +
>> +	return mask;
>>  }
>>  
>>  uint64_t arch__user_reg_mask(void)
>>  {
>> -	return PERF_REGS_MASK;
>> +	uint64_t mask = PERF_REGS_MASK;
>> +
>> +	if (has_cap_simd_regs()) {
>> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>> +					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
>> +					 true);
>> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
>> +					 BIT_ULL(PERF_REG_X86_SSP),
>> +					 true);
>> +	}
>> +
>> +	return mask;
>> +}
>> +
>> +uint64_t arch__intr_simd_reg_mask(u16 *qwords)
>> +{
>> +	if (!has_cap_simd_regs())
>> +		return 0;
>> +	*qwords = intr_simd_qwords;
>> +	return intr_simd_mask;
>> +}
>> +
>> +uint64_t arch__user_simd_reg_mask(u16 *qwords)
>> +{
>> +	if (!has_cap_simd_regs())
>> +		return 0;
>> +	*qwords = user_simd_qwords;
>> +	return user_simd_mask;
>> +}
>> +
>> +uint64_t arch__intr_pred_reg_mask(u16 *qwords)
>> +{
>> +	if (!has_cap_simd_regs())
>> +		return 0;
>> +	*qwords = pred_qwords;
>> +	return pred_mask;
>> +}
>> +
>> +uint64_t arch__user_pred_reg_mask(u16 *qwords)
>> +{
>> +	if (!has_cap_simd_regs())
>> +		return 0;
>> +	*qwords = pred_qwords;
>> +	return pred_mask;
>>  }
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index d55482f094bf..af6e1c843fc5 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -1402,12 +1402,37 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>  		evsel__set_sample_bit(evsel, REGS_INTR);
>>  	}
>>  
>> +	if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
>> +	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>> +		/* The pred qwords is to implies the set of SIMD registers is used */
>> +		if (opts->sample_pred_regs_qwords)
>> +			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>> +		else
>> +			attr->sample_simd_pred_reg_qwords = 1;
>> +		attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
>> +		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>> +		attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
>> +		evsel__set_sample_bit(evsel, REGS_INTR);
>> +	}
>> +
>>  	if (opts->sample_user_regs && !evsel->no_aux_samples &&
>>  	    !evsel__is_dummy_event(evsel)) {
>>  		attr->sample_regs_user |= opts->sample_user_regs;
>>  		evsel__set_sample_bit(evsel, REGS_USER);
>>  	}
>>  
>> +	if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
>> +	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
>> +		if (opts->sample_pred_regs_qwords)
>> +			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
>> +		else
>> +			attr->sample_simd_pred_reg_qwords = 1;
>> +		attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
>> +		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
>> +		attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
>> +		evsel__set_sample_bit(evsel, REGS_USER);
>> +	}
>> +
>>  	if (target__has_cpu(&opts->target) || opts->sample_cpu)
>>  		evsel__set_sample_bit(evsel, CPU);
>>  
>> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
>> index cda1c620968e..27266038352f 100644
>> --- a/tools/perf/util/parse-regs-options.c
>> +++ b/tools/perf/util/parse-regs-options.c
>> @@ -4,20 +4,26 @@
>>  #include <stdint.h>
>>  #include <string.h>
>>  #include <stdio.h>
>> +#include <linux/bitops.h>
>>  #include "util/debug.h"
>>  #include <subcmd/parse-options.h>
>>  #include "util/perf_regs.h"
>>  #include "util/parse-regs-options.h"
>> +#include "record.h"
>>  
>>  static int
>>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>  {
>>  	uint64_t *mode = (uint64_t *)opt->value;
>>  	const struct sample_reg *r = NULL;
>> +	u16 simd_qwords, pred_qwords;
>> +	u64 simd_mask, pred_mask;
>> +	struct record_opts *opts;
>>  	char *s, *os = NULL, *p;
>>  	int ret = -1;
>>  	uint64_t mask;
>>  
>> +
>>  	if (unset)
>>  		return 0;
>>  
>> @@ -27,10 +33,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>  	if (*mode)
>>  		return -1;
>>  
>> -	if (intr)
>> +	if (intr) {
>> +		opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>>  		mask = arch__intr_reg_mask();
>> -	else
>> +		simd_mask = arch__intr_simd_reg_mask(&simd_qwords);
>> +		pred_mask = arch__intr_pred_reg_mask(&pred_qwords);
>> +	} else {
>> +		opts = container_of(opt->value, struct record_opts, sample_user_regs);
>>  		mask = arch__user_reg_mask();
>> +		simd_mask = arch__user_simd_reg_mask(&simd_qwords);
>> +		pred_mask = arch__user_pred_reg_mask(&pred_qwords);
>> +	}
>>  
>>  	/* str may be NULL in case no arg is passed to -I */
>>  	if (str) {
>> @@ -50,10 +63,51 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>  					if (r->mask & mask)
>>  						fprintf(stderr, "%s ", r->name);
>>  				}
>> +				for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>> +					if (pred_qwords == r->qwords.pred) {
>> +						fprintf(stderr, "%s0-%d ", r->name, fls64(pred_mask) - 1);
>> +						continue;
>> +					}
>> +					if (simd_qwords >= r->mask)
>> +						fprintf(stderr, "%s0-%d ", r->name, fls64(simd_mask) - 1);
>> +				}
>> +
>>  				fputc('\n', stderr);
>>  				/* just printing available regs */
>>  				goto error;
>>  			}
>> +
>> +			if (simd_mask || pred_mask) {
>> +				u16 vec_regs_qwords = 0, pred_regs_qwords = 0;
>> +
>> +				for (r = arch__sample_simd_reg_masks(); r->name; r++) {
>> +					if (!strcasecmp(s, r->name)) {
>> +						vec_regs_qwords = r->qwords.vec;
>> +						pred_regs_qwords = r->qwords.pred;
>> +						break;
>> +					}
>> +				}
>> +
>> +				/* Just need the highest qwords */
>> +				if (vec_regs_qwords > opts->sample_vec_regs_qwords) {
>> +					opts->sample_vec_regs_qwords = vec_regs_qwords;
>> +					if (intr)
>> +						opts->sample_intr_vec_regs = simd_mask;
>> +					else
>> +						opts->sample_user_vec_regs = simd_mask;
>> +				}
>> +				if (pred_regs_qwords > opts->sample_pred_regs_qwords) {
>> +					opts->sample_pred_regs_qwords = pred_regs_qwords;
>> +					if (intr)
>> +						opts->sample_intr_pred_regs = pred_mask;
>> +					else
>> +						opts->sample_user_pred_regs = pred_mask;
>> +				}
>> +
>> +				if (r->name)
>> +					goto next;
>> +			}
>> +
>>  			for (r = arch__sample_reg_masks(); r->name; r++) {
>>  				if ((r->mask & mask) && !strcasecmp(s, r->name))
>>  					break;
>> @@ -65,7 +119,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>>  			}
>>  
>>  			*mode |= r->mask;
>> -
>> +next:
>>  			if (!p)
>>  				break;
>>  
>> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
>> index 66b666d9ce64..fb0366d050cf 100644
>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>>  	PRINT_ATTRf(aux_start_paused, p_unsigned);
>>  	PRINT_ATTRf(aux_pause, p_unsigned);
>>  	PRINT_ATTRf(aux_resume, p_unsigned);
>> +	PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
>> +	PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
>> +	PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
>> +	PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
>> +	PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
>> +	PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>>  
>>  	return ret;
>>  }
>> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
>> index 44b90bbf2d07..0744c77b4ac8 100644
>> --- a/tools/perf/util/perf_regs.c
>> +++ b/tools/perf/util/perf_regs.c
>> @@ -21,6 +21,30 @@ uint64_t __weak arch__user_reg_mask(void)
>>  	return 0;
>>  }
>>  
>> +uint64_t __weak arch__intr_simd_reg_mask(u16 *qwords)
>> +{
>> +	*qwords = 0;
>> +	return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_simd_reg_mask(u16 *qwords)
>> +{
>> +	*qwords = 0;
>> +	return 0;
>> +}
>> +
>> +uint64_t __weak arch__intr_pred_reg_mask(u16 *qwords)
>> +{
>> +	*qwords = 0;
>> +	return 0;
>> +}
>> +
>> +uint64_t __weak arch__user_pred_reg_mask(u16 *qwords)
>> +{
>> +	*qwords = 0;
>> +	return 0;
>> +}
>> +
>>  static const struct sample_reg sample_reg_masks[] = {
>>  	SMPL_REG_END
>>  };
>> @@ -30,6 +54,11 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>>  	return sample_reg_masks;
>>  }
>>  
>> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
>> +{
>> +	return sample_reg_masks;
>> +}
>> +
>>  const char *perf_reg_name(int id, const char *arch)
>>  {
>>  	const char *reg_name = NULL;
>> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
>> index f2d0736d65cc..b932caa73a8a 100644
>> --- a/tools/perf/util/perf_regs.h
>> +++ b/tools/perf/util/perf_regs.h
>> @@ -9,7 +9,13 @@ struct regs_dump;
>>  
>>  struct sample_reg {
>>  	const char *name;
>> -	uint64_t mask;
>> +	union {
>> +		struct {
>> +			uint32_t vec;
>> +			uint32_t pred;
>> +		} qwords;
>> +		uint64_t mask;
>> +	};
>>  };
>>  
>>  #define SMPL_REG_MASK(b) (1ULL << (b))
>> @@ -27,6 +33,11 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>>  uint64_t arch__intr_reg_mask(void);
>>  uint64_t arch__user_reg_mask(void);
>>  const struct sample_reg *arch__sample_reg_masks(void);
>> +const struct sample_reg *arch__sample_simd_reg_masks(void);
>> +uint64_t arch__intr_simd_reg_mask(u16 *qwords);
>> +uint64_t arch__user_simd_reg_mask(u16 *qwords);
>> +uint64_t arch__intr_pred_reg_mask(u16 *qwords);
>> +uint64_t arch__user_pred_reg_mask(u16 *qwords);
>>  
>>  const char *perf_reg_name(int id, const char *arch);
>>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
>> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
>> index ea3a6c4657ee..825ffb4cc53f 100644
>> --- a/tools/perf/util/record.h
>> +++ b/tools/perf/util/record.h
>> @@ -59,7 +59,13 @@ struct record_opts {
>>  	unsigned int  user_freq;
>>  	u64	      branch_stack;
>>  	u64	      sample_intr_regs;
>> +	u64	      sample_intr_vec_regs;
>>  	u64	      sample_user_regs;
>> +	u64	      sample_user_vec_regs;
>> +	u16	      sample_pred_regs_qwords;
>> +	u16	      sample_vec_regs_qwords;
>> +	u16	      sample_intr_pred_regs;
>> +	u16	      sample_user_pred_regs;
>>  	u64	      default_interval;
>>  	u64	      user_interval;
>>  	size_t	      auxtrace_snapshot_size;


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V3 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER
  2025-08-20 18:03         ` Liang, Kan
@ 2025-08-21  1:00           ` Mi, Dapeng
  0 siblings, 0 replies; 32+ messages in thread
From: Mi, Dapeng @ 2025-08-21  1:00 UTC (permalink / raw)
  To: Liang, Kan, Peter Zijlstra
  Cc: mingo, acme, namhyung, tglx, dave.hansen, irogers, adrian.hunter,
	jolsa, alexander.shishkin, linux-kernel, ak, zide.chen,
	mark.rutland, broonie, ravi.bangoria, eranian


On 8/21/2025 2:03 AM, Liang, Kan wrote:
>
> On 2025-08-20 2:46 a.m., Mi, Dapeng wrote:
>> On 8/19/2025 11:55 PM, Liang, Kan wrote:
>>> On 2025-08-19 6:39 a.m., Peter Zijlstra wrote:
>>>> On Fri, Aug 15, 2025 at 02:34:23PM -0700, kan.liang@linux.intel.com wrote:
>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>
>>>>> Collecting the XMM registers in a PEBS record has been supported since
>>>>> the Icelake. But non-PEBS events don't support the feature. It's
>>>>> possible to retrieve the XMM registers from the XSAVE for non-PEBS.
>>>>> Add it to make the feature complete.
>>>>>
>>>>> To utilize the XSAVE, a 64-byte aligned buffer is required. Add a
>>>>> per-CPU ext_regs_buf to store the vector registers. The size of the
>>>>> buffer is ~2K. kzalloc_node() is used because there's a _guarantee_
>>>>> that all kmalloc()'s with powers of 2 are naturally aligned and also
>>>>> 64b aligned.
>>>>>
>>>>> Extend the support for both REGS_USER and REGS_INTR. For REGS_USER, the
>>>>> perf_get_regs_user() returns the regs from the task_pt_regs(current),
>>>>> which is struct pt_regs. Need to move it to local struct x86_perf_regs
>>>>> x86_user_regs.
>>>>> For PEBS, the HW support is still preferred. The XMM should be retrieved
>>>>> from PEBS records.
>>>>>
>>>>> There could be more vector registers supported later. Add ext_regs_mask
>>>>> to track the supported vector register group.
>>>> I'm a little confused... *again* :-)
>>>>
>>>> Specifically, we should consider two sets of registers:
>>>>
>>>>  - the live set, as per the CPU (XSAVE)
>>>>  - the stored set, as per x86_task_fpu()
>>>>
>>>> regs_intr should always get a copy of the live set; however
>>>> regs_user should not. It might need a copy of the x86_task_fpu() instead
>>>> of the live set, depending on TIF_NEED_FPU_LOAD (more or less, we need
>>>> another variable set in kernel_fpu_begin_mask() *after*
>>>> save_fpregs_to_fpstate() is completed).
>>>>
>>>> I don't see this code make this distinction.
>>>>
>>>> Consider getting a sample while the kernel is doing some avx enhanced
>>>> crypto and such.
>>> The regs_user only needs a set when the NMI hits the user mode
>>> (user_mode(regs)) or a non-kernel thread (!(current->flags &
>>> PF_KTHREAD)). The live set is good enough for both cases.
>> It's fine if NMI hits user mode, but if NMI hits the kernel mode
>> (!(current->flags &PF_KTHREAD)), won't the kernel space SIMD/eGPR regs be
>> exposed to user space for user-regs option? I'm not sure if kernel space
>> really use these SIMD/eGPR regs right now, but it seems a risk.
>>
>>
> I don't think it's possible for the existing kernel. But I cannot
> guarantee future usage.
>
> If the kernel mode handling is still a concern, I think we should drop
> the SIMD/eGPR regs for the case for now. Because
> - To profile a userspace application which requires SIMD/eGPR regs, the
> NMI usually hits the usersapce. It's not common to hit the kernel mode.
> - The SIMD/eGPR cannot be retrieved from the task_pt_regs(). Although
> it's possible to retrieve the values when the TIF_NEED_FPU_LOAD flag is
> set, I don't think it's worth introducing such complexity to handle an
> uncommon case in the critical path.
> - Furthermore, only checking the TIF_NEED_FPU_LOAD flag cannot cure
> everything. Some corner cases cannot be handled either. For example, an
> NMI can happen when the flag just switched, but still in the kernel mode.
>
> We can always add the support later if someone thinks it's important to
> retrieve the user SIMD/eGPR regs during the kernel syscall.

+1


>
> Thanks,
> Kan
>>> I think the kernel crypto should be to a kernel thread (current->flags &
>>> PF_KTHREAD). If so, the regs_user should return NULL.
>>>
>>> Thanks,
>>> Kan
>>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC PATCH 16/17] perf parse-regs: Support the new SIMD format
  2025-08-15 21:34 ` [POC PATCH 16/17] perf parse-regs: Support the new SIMD format kan.liang
  2025-08-20 10:04   ` Mi, Dapeng
@ 2025-08-21  3:35   ` Mi, Dapeng
  1 sibling, 0 replies; 32+ messages in thread
From: Mi, Dapeng @ 2025-08-21  3:35 UTC (permalink / raw)
  To: kan.liang, peterz, mingo, acme, namhyung, tglx, dave.hansen,
	irogers, adrian.hunter, jolsa, alexander.shishkin, linux-kernel
  Cc: ak, zide.chen, mark.rutland, broonie, ravi.bangoria, eranian


On 8/16/2025 5:34 AM, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Add has_cap_simd_regs() to check if the new SIMD format is available.
> If yes, get the possible mask and qwords.
>
> Add several __weak functions to return qwords and mask for vector and
> pred registers.
>
> Only support collecting the vector and pred as a whole, and only the
> superset. For example, -I XMM,YMM. Only collect all 16 YMMs.
>
> Examples:
>  $perf record -I?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7
>
>  $perf record --user-regs=?
>  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>  R11 R12 R13 R14 R15 SSP XMM0-31 YMM0-31 ZMM0-31 OPMASK0-7
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  tools/perf/arch/x86/util/perf_regs.c      | 257 +++++++++++++++++++++-
>  tools/perf/util/evsel.c                   |  25 +++
>  tools/perf/util/parse-regs-options.c      |  60 ++++-
>  tools/perf/util/perf_event_attr_fprintf.c |   6 +
>  tools/perf/util/perf_regs.c               |  29 +++
>  tools/perf/util/perf_regs.h               |  13 +-
>  tools/perf/util/record.h                  |   6 +
>  7 files changed, 381 insertions(+), 15 deletions(-)
>
> diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
> index 12fd93f04802..78027df1af9a 100644
> --- a/tools/perf/arch/x86/util/perf_regs.c
> +++ b/tools/perf/arch/x86/util/perf_regs.c
> @@ -13,6 +13,49 @@
>  #include "../../../util/pmu.h"
>  #include "../../../util/pmus.h"
>  
> +static const struct sample_reg sample_reg_masks_ext[] = {
> +	SMPL_REG(AX, PERF_REG_X86_AX),
> +	SMPL_REG(BX, PERF_REG_X86_BX),
> +	SMPL_REG(CX, PERF_REG_X86_CX),
> +	SMPL_REG(DX, PERF_REG_X86_DX),
> +	SMPL_REG(SI, PERF_REG_X86_SI),
> +	SMPL_REG(DI, PERF_REG_X86_DI),
> +	SMPL_REG(BP, PERF_REG_X86_BP),
> +	SMPL_REG(SP, PERF_REG_X86_SP),
> +	SMPL_REG(IP, PERF_REG_X86_IP),
> +	SMPL_REG(FLAGS, PERF_REG_X86_FLAGS),
> +	SMPL_REG(CS, PERF_REG_X86_CS),
> +	SMPL_REG(SS, PERF_REG_X86_SS),
> +#ifdef HAVE_ARCH_X86_64_SUPPORT
> +	SMPL_REG(R8, PERF_REG_X86_R8),
> +	SMPL_REG(R9, PERF_REG_X86_R9),
> +	SMPL_REG(R10, PERF_REG_X86_R10),
> +	SMPL_REG(R11, PERF_REG_X86_R11),
> +	SMPL_REG(R12, PERF_REG_X86_R12),
> +	SMPL_REG(R13, PERF_REG_X86_R13),
> +	SMPL_REG(R14, PERF_REG_X86_R14),
> +	SMPL_REG(R15, PERF_REG_X86_R15),
> +	SMPL_REG(R16, PERF_REG_X86_R16),
> +	SMPL_REG(R17, PERF_REG_X86_R17),
> +	SMPL_REG(R18, PERF_REG_X86_R18),
> +	SMPL_REG(R19, PERF_REG_X86_R19),
> +	SMPL_REG(R20, PERF_REG_X86_R20),
> +	SMPL_REG(R21, PERF_REG_X86_R21),
> +	SMPL_REG(R22, PERF_REG_X86_R22),
> +	SMPL_REG(R23, PERF_REG_X86_R23),
> +	SMPL_REG(R24, PERF_REG_X86_R24),
> +	SMPL_REG(R25, PERF_REG_X86_R25),
> +	SMPL_REG(R26, PERF_REG_X86_R26),
> +	SMPL_REG(R27, PERF_REG_X86_R27),
> +	SMPL_REG(R28, PERF_REG_X86_R28),
> +	SMPL_REG(R29, PERF_REG_X86_R29),
> +	SMPL_REG(R30, PERF_REG_X86_R30),
> +	SMPL_REG(R31, PERF_REG_X86_R31),
> +	SMPL_REG(SSP, PERF_REG_X86_SSP),
> +#endif
> +	SMPL_REG_END
> +};
> +
>  static const struct sample_reg sample_reg_masks[] = {
>  	SMPL_REG(AX, PERF_REG_X86_AX),
>  	SMPL_REG(BX, PERF_REG_X86_BX),
> @@ -276,27 +319,159 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>  	return SDT_ARG_VALID;
>  }
>  
> +static bool support_simd_reg(u64 sample_type, u16 qwords, u64 mask, bool pred)
> +{
> +	struct perf_event_attr attr = {
> +		.type				= PERF_TYPE_HARDWARE,
> +		.config				= PERF_COUNT_HW_CPU_CYCLES,
> +		.sample_type			= sample_type,
> +		.disabled 			= 1,
> +		.exclude_kernel			= 1,
> +		.sample_simd_regs_enabled	= 1,
> +	};
> +	int fd;
> +
> +	attr.sample_period = 1;
> +
> +	if (!pred) {
> +		attr.sample_simd_vec_reg_qwords = qwords;
> +		if (sample_type == PERF_SAMPLE_REGS_INTR)
> +			attr.sample_simd_vec_reg_intr = mask;
> +		else
> +			attr.sample_simd_vec_reg_user = mask;
> +	} else {
> +		attr.sample_simd_pred_reg_qwords = PERF_X86_OPMASK_QWORDS;
> +		if (sample_type == PERF_SAMPLE_REGS_INTR)
> +			attr.sample_simd_pred_reg_intr = PERF_X86_SIMD_PRED_MASK;
> +		else
> +			attr.sample_simd_pred_reg_user = PERF_X86_SIMD_PRED_MASK;
> +	}
> +
> +	if (perf_pmus__num_core_pmus() > 1) {
> +		struct perf_pmu *pmu = NULL;
> +		__u64 type = PERF_TYPE_RAW;
> +
> +		/*
> +		 * The same register set is supported among different hybrid PMUs.
> +		 * Only check the first available one.
> +		 */
> +		while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
> +			type = pmu->type;
> +			break;
> +		}
> +		attr.config |= type << PERF_PMU_TYPE_SHIFT;
> +	}
> +
> +	event_attr_init(&attr);
> +
> +	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> +	if (fd != -1) {
> +		close(fd);
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +static uint64_t intr_simd_mask, user_simd_mask, pred_mask;
> +static u16	intr_simd_qwords, user_simd_qwords, pred_qwords;
> +
> +static bool get_simd_reg_mask(u64 sample_type)
> +{
> +	u64 mask = GENMASK_ULL(PERF_X86_H16ZMM_BASE - 1, 0);
> +	u16 qwords = PERF_X86_ZMM_QWORDS;
> +
> +	if (support_simd_reg(sample_type, qwords, mask, false)) {
> +		if (support_simd_reg(sample_type, qwords, PERF_X86_SIMD_VEC_MASK, false))
> +			mask = PERF_X86_SIMD_VEC_MASK;
> +	} else {
> +		qwords = PERF_X86_YMM_QWORDS;
> +		if (!support_simd_reg(sample_type, qwords, mask, false)) {
> +			qwords = PERF_X86_XMM_QWORDS;
> +			if (!support_simd_reg(sample_type, qwords, mask, false)) {
> +				qwords = 0;
> +				mask = 0;
> +			}
> +		}
> +	}
> +
> +	if (sample_type == PERF_SAMPLE_REGS_INTR) {
> +		intr_simd_mask = mask;
> +		intr_simd_qwords = qwords;
> +	} else {
> +		user_simd_mask = mask;
> +		user_simd_qwords = qwords;
> +	}

It looks we only use a global variable to save the the SIMD regs mask, but
different SIMD regs has different MASK, e.g., ZMM has 32 regs  but XMM/YMM
only has 16 regs. So If HW supports ZMM16 ~ ZMM 31, the SIMD regs mask
would be always bet to 0xffffffff, is it correct for YMM/XMM regs?


> +
> +	if (support_simd_reg(sample_type, qwords, mask, true)) {
> +		pred_mask = PERF_X86_SIMD_PRED_MASK;
> +		pred_qwords = PERF_X86_OPMASK_QWORDS;
> +	}
> +
> +	return true;

It seems this function always returns true. Feel incorrect.


> +}
> +
> +static bool has_cap_simd_regs(void)
> +{
> +	static bool has_cap_simd_regs;
> +	static bool cached;
> +
> +	if (cached)
> +		return has_cap_simd_regs;
> +
> +	cached = true;
> +	has_cap_simd_regs = get_simd_reg_mask(PERF_SAMPLE_REGS_INTR);
> +	has_cap_simd_regs |= get_simd_reg_mask(PERF_SAMPLE_REGS_USER);
> +
> +	return has_cap_simd_regs;
> +}
> +
>  const struct sample_reg *arch__sample_reg_masks(void)
>  {
> +	if (has_cap_simd_regs())
> +		return sample_reg_masks_ext;
>  	return sample_reg_masks;
>  }
>  
> -uint64_t arch__intr_reg_mask(void)
> +static const struct sample_reg sample_simd_reg_masks_empty[] = {
> +	SMPL_REG_END
> +};
> +
> +static const struct sample_reg sample_simd_reg_masks[] = {
> +	SMPL_REG(XMM, 1),
> +	SMPL_REG(YMM, 2),
> +	SMPL_REG(ZMM, 3),
> +	SMPL_REG(OPMASK, 32),
> +	SMPL_REG_END
> +};

We extend the ".mask" field to represent SIMD mask and qword size
simultaneously. It works, but it's really hard to understand and increase
the complexity. Could we just define some global variables or helpers to
get the mask and qword size for each kind of SIMD regs, like XMM/YMM/ZMM regs? 


> +
> +const struct sample_reg *arch__sample_simd_reg_masks(void)
> +{
> +	if (has_cap_simd_regs())
> +		return sample_simd_reg_masks;
> +	return sample_simd_reg_masks_empty;
> +}
> +
> +static uint64_t __arch__reg_mask(u64 sample_type, u64 mask, bool has_simd_regs)
>  {
>  	struct perf_event_attr attr = {
> -		.type			= PERF_TYPE_HARDWARE,
> -		.config			= PERF_COUNT_HW_CPU_CYCLES,
> -		.sample_type		= PERF_SAMPLE_REGS_INTR,
> -		.sample_regs_intr	= PERF_REG_EXTENDED_MASK,
> -		.precise_ip		= 1,
> -		.disabled 		= 1,
> -		.exclude_kernel		= 1,
> +		.type				= PERF_TYPE_HARDWARE,
> +		.config				= PERF_COUNT_HW_CPU_CYCLES,
> +		.sample_type			= sample_type,
> +		.precise_ip			= 1,
> +		.disabled 			= 1,
> +		.exclude_kernel			= 1,
> +		.sample_simd_regs_enabled	= has_simd_regs,
>  	};
>  	int fd;
>  	/*
>  	 * In an unnamed union, init it here to build on older gcc versions
>  	 */
>  	attr.sample_period = 1;
> +	if (sample_type == PERF_SAMPLE_REGS_INTR)
> +		attr.sample_regs_intr = mask;
> +	else
> +		attr.sample_regs_user = mask;
>  
>  	if (perf_pmus__num_core_pmus() > 1) {
>  		struct perf_pmu *pmu = NULL;
> @@ -318,13 +493,73 @@ uint64_t arch__intr_reg_mask(void)
>  	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>  	if (fd != -1) {
>  		close(fd);
> -		return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
> +		return mask;
>  	}
>  
> -	return PERF_REGS_MASK;
> +	return 0;
> +}
> +
> +uint64_t arch__intr_reg_mask(void)
> +{
> +	uint64_t mask = PERF_REGS_MASK;
> +
> +	if (has_cap_simd_regs()) {
> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> +					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> +					 true);
> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR,
> +					 BIT_ULL(PERF_REG_X86_SSP),
> +					 true);
> +	} else
> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_INTR, PERF_REG_EXTENDED_MASK, false);
> +
> +	return mask;
>  }
>  
>  uint64_t arch__user_reg_mask(void)
>  {
> -	return PERF_REGS_MASK;
> +	uint64_t mask = PERF_REGS_MASK;
> +
> +	if (has_cap_simd_regs()) {
> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> +					 GENMASK_ULL(PERF_REG_X86_R31, PERF_REG_X86_R16),
> +					 true);
> +		mask |= __arch__reg_mask(PERF_SAMPLE_REGS_USER,
> +					 BIT_ULL(PERF_REG_X86_SSP),
> +					 true);
> +	}
> +
> +	return mask;
> +}
> +
> +uint64_t arch__intr_simd_reg_mask(u16 *qwords)
> +{
> +	if (!has_cap_simd_regs())
> +		return 0;
> +	*qwords = intr_simd_qwords;
> +	return intr_simd_mask;
> +}
> +
> +uint64_t arch__user_simd_reg_mask(u16 *qwords)
> +{
> +	if (!has_cap_simd_regs())
> +		return 0;
> +	*qwords = user_simd_qwords;
> +	return user_simd_mask;
> +}
> +
> +uint64_t arch__intr_pred_reg_mask(u16 *qwords)
> +{
> +	if (!has_cap_simd_regs())
> +		return 0;
> +	*qwords = pred_qwords;
> +	return pred_mask;
> +}
> +
> +uint64_t arch__user_pred_reg_mask(u16 *qwords)
> +{
> +	if (!has_cap_simd_regs())
> +		return 0;
> +	*qwords = pred_qwords;
> +	return pred_mask;
>  }
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index d55482f094bf..af6e1c843fc5 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1402,12 +1402,37 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>  		evsel__set_sample_bit(evsel, REGS_INTR);
>  	}
>  
> +	if ((opts->sample_intr_vec_regs || opts->sample_intr_pred_regs) &&
> +	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> +		/* The pred qwords is to implies the set of SIMD registers is used */
> +		if (opts->sample_pred_regs_qwords)
> +			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> +		else
> +			attr->sample_simd_pred_reg_qwords = 1;
> +		attr->sample_simd_vec_reg_intr = opts->sample_intr_vec_regs;
> +		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> +		attr->sample_simd_pred_reg_intr = opts->sample_intr_pred_regs;
> +		evsel__set_sample_bit(evsel, REGS_INTR);
> +	}
> +
>  	if (opts->sample_user_regs && !evsel->no_aux_samples &&
>  	    !evsel__is_dummy_event(evsel)) {
>  		attr->sample_regs_user |= opts->sample_user_regs;
>  		evsel__set_sample_bit(evsel, REGS_USER);
>  	}
>  
> +	if ((opts->sample_user_vec_regs || opts->sample_user_pred_regs) &&
> +	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
> +		if (opts->sample_pred_regs_qwords)
> +			attr->sample_simd_pred_reg_qwords = opts->sample_pred_regs_qwords;
> +		else
> +			attr->sample_simd_pred_reg_qwords = 1;
> +		attr->sample_simd_vec_reg_user = opts->sample_user_vec_regs;
> +		attr->sample_simd_vec_reg_qwords = opts->sample_vec_regs_qwords;
> +		attr->sample_simd_pred_reg_user = opts->sample_user_pred_regs;
> +		evsel__set_sample_bit(evsel, REGS_USER);
> +	}
> +
>  	if (target__has_cpu(&opts->target) || opts->sample_cpu)
>  		evsel__set_sample_bit(evsel, CPU);
>  
> diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
> index cda1c620968e..27266038352f 100644
> --- a/tools/perf/util/parse-regs-options.c
> +++ b/tools/perf/util/parse-regs-options.c
> @@ -4,20 +4,26 @@
>  #include <stdint.h>
>  #include <string.h>
>  #include <stdio.h>
> +#include <linux/bitops.h>
>  #include "util/debug.h"
>  #include <subcmd/parse-options.h>
>  #include "util/perf_regs.h"
>  #include "util/parse-regs-options.h"
> +#include "record.h"
>  
>  static int
>  __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>  {
>  	uint64_t *mode = (uint64_t *)opt->value;
>  	const struct sample_reg *r = NULL;
> +	u16 simd_qwords, pred_qwords;
> +	u64 simd_mask, pred_mask;
> +	struct record_opts *opts;
>  	char *s, *os = NULL, *p;
>  	int ret = -1;
>  	uint64_t mask;
>  
> +
>  	if (unset)
>  		return 0;
>  
> @@ -27,10 +33,17 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>  	if (*mode)
>  		return -1;
>  
> -	if (intr)
> +	if (intr) {
> +		opts = container_of(opt->value, struct record_opts, sample_intr_regs);
>  		mask = arch__intr_reg_mask();
> -	else
> +		simd_mask = arch__intr_simd_reg_mask(&simd_qwords);
> +		pred_mask = arch__intr_pred_reg_mask(&pred_qwords);
> +	} else {
> +		opts = container_of(opt->value, struct record_opts, sample_user_regs);
>  		mask = arch__user_reg_mask();
> +		simd_mask = arch__user_simd_reg_mask(&simd_qwords);
> +		pred_mask = arch__user_pred_reg_mask(&pred_qwords);
> +	}
>  
>  	/* str may be NULL in case no arg is passed to -I */
>  	if (str) {
> @@ -50,10 +63,51 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>  					if (r->mask & mask)
>  						fprintf(stderr, "%s ", r->name);
>  				}
> +				for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +					if (pred_qwords == r->qwords.pred) {
> +						fprintf(stderr, "%s0-%d ", r->name, fls64(pred_mask) - 1);
> +						continue;
> +					}
> +					if (simd_qwords >= r->mask)
> +						fprintf(stderr, "%s0-%d ", r->name, fls64(simd_mask) - 1);
> +				}
> +
>  				fputc('\n', stderr);
>  				/* just printing available regs */
>  				goto error;
>  			}
> +
> +			if (simd_mask || pred_mask) {
> +				u16 vec_regs_qwords = 0, pred_regs_qwords = 0;
> +
> +				for (r = arch__sample_simd_reg_masks(); r->name; r++) {
> +					if (!strcasecmp(s, r->name)) {
> +						vec_regs_qwords = r->qwords.vec;
> +						pred_regs_qwords = r->qwords.pred;
> +						break;
> +					}
> +				}
> +
> +				/* Just need the highest qwords */
> +				if (vec_regs_qwords > opts->sample_vec_regs_qwords) {
> +					opts->sample_vec_regs_qwords = vec_regs_qwords;
> +					if (intr)
> +						opts->sample_intr_vec_regs = simd_mask;
> +					else
> +						opts->sample_user_vec_regs = simd_mask;
> +				}
> +				if (pred_regs_qwords > opts->sample_pred_regs_qwords) {
> +					opts->sample_pred_regs_qwords = pred_regs_qwords;
> +					if (intr)
> +						opts->sample_intr_pred_regs = pred_mask;
> +					else
> +						opts->sample_user_pred_regs = pred_mask;
> +				}
> +
> +				if (r->name)
> +					goto next;
> +			}
> +
>  			for (r = arch__sample_reg_masks(); r->name; r++) {
>  				if ((r->mask & mask) && !strcasecmp(s, r->name))
>  					break;
> @@ -65,7 +119,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
>  			}
>  
>  			*mode |= r->mask;
> -
> +next:
>  			if (!p)
>  				break;
>  
> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> index 66b666d9ce64..fb0366d050cf 100644
> --- a/tools/perf/util/perf_event_attr_fprintf.c
> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> @@ -360,6 +360,12 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>  	PRINT_ATTRf(aux_start_paused, p_unsigned);
>  	PRINT_ATTRf(aux_pause, p_unsigned);
>  	PRINT_ATTRf(aux_resume, p_unsigned);
> +	PRINT_ATTRf(sample_simd_pred_reg_qwords, p_unsigned);
> +	PRINT_ATTRf(sample_simd_pred_reg_intr, p_hex);
> +	PRINT_ATTRf(sample_simd_pred_reg_user, p_hex);
> +	PRINT_ATTRf(sample_simd_vec_reg_qwords, p_unsigned);
> +	PRINT_ATTRf(sample_simd_vec_reg_intr, p_hex);
> +	PRINT_ATTRf(sample_simd_vec_reg_user, p_hex);
>  
>  	return ret;
>  }
> diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
> index 44b90bbf2d07..0744c77b4ac8 100644
> --- a/tools/perf/util/perf_regs.c
> +++ b/tools/perf/util/perf_regs.c
> @@ -21,6 +21,30 @@ uint64_t __weak arch__user_reg_mask(void)
>  	return 0;
>  }
>  
> +uint64_t __weak arch__intr_simd_reg_mask(u16 *qwords)
> +{
> +	*qwords = 0;
> +	return 0;
> +}
> +
> +uint64_t __weak arch__user_simd_reg_mask(u16 *qwords)
> +{
> +	*qwords = 0;
> +	return 0;
> +}
> +
> +uint64_t __weak arch__intr_pred_reg_mask(u16 *qwords)
> +{
> +	*qwords = 0;
> +	return 0;
> +}
> +
> +uint64_t __weak arch__user_pred_reg_mask(u16 *qwords)
> +{
> +	*qwords = 0;
> +	return 0;
> +}
> +
>  static const struct sample_reg sample_reg_masks[] = {
>  	SMPL_REG_END
>  };
> @@ -30,6 +54,11 @@ const struct sample_reg * __weak arch__sample_reg_masks(void)
>  	return sample_reg_masks;
>  }
>  
> +const struct sample_reg * __weak arch__sample_simd_reg_masks(void)
> +{
> +	return sample_reg_masks;
> +}
> +
>  const char *perf_reg_name(int id, const char *arch)
>  {
>  	const char *reg_name = NULL;
> diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
> index f2d0736d65cc..b932caa73a8a 100644
> --- a/tools/perf/util/perf_regs.h
> +++ b/tools/perf/util/perf_regs.h
> @@ -9,7 +9,13 @@ struct regs_dump;
>  
>  struct sample_reg {
>  	const char *name;
> -	uint64_t mask;
> +	union {
> +		struct {
> +			uint32_t vec;
> +			uint32_t pred;
> +		} qwords;
> +		uint64_t mask;
> +	};
>  };
>  
>  #define SMPL_REG_MASK(b) (1ULL << (b))
> @@ -27,6 +33,11 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op);
>  uint64_t arch__intr_reg_mask(void);
>  uint64_t arch__user_reg_mask(void);
>  const struct sample_reg *arch__sample_reg_masks(void);
> +const struct sample_reg *arch__sample_simd_reg_masks(void);
> +uint64_t arch__intr_simd_reg_mask(u16 *qwords);
> +uint64_t arch__user_simd_reg_mask(u16 *qwords);
> +uint64_t arch__intr_pred_reg_mask(u16 *qwords);
> +uint64_t arch__user_pred_reg_mask(u16 *qwords);
>  
>  const char *perf_reg_name(int id, const char *arch);
>  int perf_reg_value(u64 *valp, struct regs_dump *regs, int id);
> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> index ea3a6c4657ee..825ffb4cc53f 100644
> --- a/tools/perf/util/record.h
> +++ b/tools/perf/util/record.h
> @@ -59,7 +59,13 @@ struct record_opts {
>  	unsigned int  user_freq;
>  	u64	      branch_stack;
>  	u64	      sample_intr_regs;
> +	u64	      sample_intr_vec_regs;
>  	u64	      sample_user_regs;
> +	u64	      sample_user_vec_regs;
> +	u16	      sample_pred_regs_qwords;
> +	u16	      sample_vec_regs_qwords;
> +	u16	      sample_intr_pred_regs;
> +	u16	      sample_user_pred_regs;
>  	u64	      default_interval;
>  	u64	      user_interval;
>  	size_t	      auxtrace_snapshot_size;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC PATCH 14/17] perf/x86/regs: Only support legacy regs for the PT and PERF_REGS_MASK for now
  2025-08-15 21:34 ` [POC PATCH 14/17] perf/x86/regs: Only support legacy regs for the PT and PERF_REGS_MASK for now kan.liang
@ 2025-08-25  9:07   ` Adrian Hunter
  0 siblings, 0 replies; 32+ messages in thread
From: Adrian Hunter @ 2025-08-25  9:07 UTC (permalink / raw)
  To: kan.liang, peterz, mingo, acme, namhyung, tglx, dave.hansen,
	irogers, jolsa, alexander.shishkin, linux-kernel
  Cc: dapeng1.mi, ak, zide.chen, mark.rutland, broonie, ravi.bangoria,
	eranian

On 16/08/2025 00:34, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> The PERF_REG_X86_64_MAX is going to be updated to support more regs,
> e.g., eGPRs.
> However, the PT and PERF_REGS_MASK will not be touched in the POC.
> Using the PERF_REG_X86_R15 + 1 to replace PERF_REG_X86_64_MAX.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>

For intel-pt.c

Acked-by: Adrian Hunter <adrian.hunter@intel.com>

> ---
>  tools/perf/arch/x86/include/perf_regs.h | 2 +-
>  tools/perf/util/intel-pt.c              | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/arch/x86/include/perf_regs.h b/tools/perf/arch/x86/include/perf_regs.h
> index f209ce2c1dd9..793fb597b03f 100644
> --- a/tools/perf/arch/x86/include/perf_regs.h
> +++ b/tools/perf/arch/x86/include/perf_regs.h
> @@ -17,7 +17,7 @@ void perf_regs_load(u64 *regs);
>  		       (1ULL << PERF_REG_X86_ES) | \
>  		       (1ULL << PERF_REG_X86_FS) | \
>  		       (1ULL << PERF_REG_X86_GS))
> -#define PERF_REGS_MASK (((1ULL << PERF_REG_X86_64_MAX) - 1) & ~REG_NOSUPPORT)
> +#define PERF_REGS_MASK (((1ULL << (PERF_REG_X86_R15 + 1)) - 1) & ~REG_NOSUPPORT)
>  #define PERF_SAMPLE_REGS_ABI PERF_SAMPLE_REGS_ABI_64
>  #endif
>  
> diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
> index 9b1011fe4826..a9585524f2e1 100644
> --- a/tools/perf/util/intel-pt.c
> +++ b/tools/perf/util/intel-pt.c
> @@ -2181,7 +2181,7 @@ static u64 *intel_pt_add_gp_regs(struct regs_dump *intr_regs, u64 *pos,
>  	u32 bit;
>  	int i;
>  
> -	for (i = 0, bit = 1; i < PERF_REG_X86_64_MAX; i++, bit <<= 1) {
> +	for (i = 0, bit = 1; i < PERF_REG_X86_R15 + 1; i++, bit <<= 1) {
>  		/* Get the PEBS gp_regs array index */
>  		int n = pebs_gp_regs[i] - 1;
>  


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-08-25  9:08 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-15 21:34 [PATCH V3 00/17] Support vector and more extended registers in perf kan.liang
2025-08-15 21:34 ` [PATCH V3 01/17] perf/x86: Use x86_perf_regs in the x86 nmi handler kan.liang
2025-08-15 21:34 ` [PATCH V3 02/17] perf/x86: Setup the regs data kan.liang
2025-08-15 21:34 ` [PATCH V3 03/17] x86/fpu/xstate: Add xsaves_nmi kan.liang
2025-08-15 21:34 ` [PATCH V3 04/17] perf: Move has_extended_regs() to header file kan.liang
2025-08-15 21:34 ` [PATCH V3 05/17] perf/x86: Support XMM register for non-PEBS and REGS_USER kan.liang
2025-08-19 13:39   ` Peter Zijlstra
2025-08-19 15:55     ` Liang, Kan
2025-08-20  9:46       ` Mi, Dapeng
2025-08-20 18:03         ` Liang, Kan
2025-08-21  1:00           ` Mi, Dapeng
2025-08-15 21:34 ` [PATCH V3 06/17] perf: Support SIMD registers kan.liang
2025-08-20  9:55   ` Mi, Dapeng
2025-08-20 18:08     ` Liang, Kan
2025-08-15 21:34 ` [PATCH V3 07/17] perf/x86: Move XMM to sample_simd_vec_regs kan.liang
2025-08-15 21:34 ` [PATCH V3 08/17] perf/x86: Add YMM into sample_simd_vec_regs kan.liang
2025-08-20  9:59   ` Mi, Dapeng
2025-08-20 18:10     ` Liang, Kan
2025-08-15 21:34 ` [PATCH V3 09/17] perf/x86: Add ZMM " kan.liang
2025-08-15 21:34 ` [PATCH V3 10/17] perf/x86: Add OPMASK into sample_simd_pred_reg kan.liang
2025-08-15 21:34 ` [PATCH V3 11/17] perf/x86: Add eGPRs into sample_regs kan.liang
2025-08-20 10:01   ` Mi, Dapeng
2025-08-15 21:34 ` [PATCH V3 12/17] perf/x86: Add SSP " kan.liang
2025-08-15 21:34 ` [PATCH V3 13/17] perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS kan.liang
2025-08-15 21:34 ` [POC PATCH 14/17] perf/x86/regs: Only support legacy regs for the PT and PERF_REGS_MASK for now kan.liang
2025-08-25  9:07   ` Adrian Hunter
2025-08-15 21:34 ` [POC PATCH 15/17] tools headers: Sync with the kernel sources kan.liang
2025-08-15 21:34 ` [POC PATCH 16/17] perf parse-regs: Support the new SIMD format kan.liang
2025-08-20 10:04   ` Mi, Dapeng
2025-08-20 18:18     ` Liang, Kan
2025-08-21  3:35   ` Mi, Dapeng
2025-08-15 21:34 ` [POC PATCH 17/17] perf regs: Support the PERF_SAMPLE_REGS_ABI_SIMD kan.liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).