* [PATCH 0/5] powerpc/htm: Add interface to expose HTM trace data via perf
@ 2026-07-01 8:38 Athira Rajeev
2026-07-01 8:38 ` [PATCH 1/5] " Athira Rajeev
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: Athira Rajeev @ 2026-07-01 8:38 UTC (permalink / raw)
To: linuxppc-dev, maddy
Cc: linux-perf-users, atrajeev, hbathini, tejas05, venkat88, tshah
H_HTM (Hardware Trace Macro) hypervisor call is an HCALL to export data
from Hardware Trace Macro (HTM) function. Patchset adds support for setup,
configuration and control of HTM functions as well as trace data
collection via perf PMU interface.
H_HTM is used as an interface for executing Hardware Trace Macro (HTM)
functions, including setup, configuration, control and dumping of the
HTM trace data. HTM operations can be controlled using the H_HTM hcall.
The hcall can be invoked for any core/chip of the system from within a
partition itself.
HTM perf interface usage:
The HTM (Hardware Trace Macro) perf interface enables collection and
analysis of hardware trace data from PowerPC systems. This interface
allows users to capture detailed execution traces for performance
analysis and debugging. The interface uses AUX infrastructure for
capturing of trace data.
Patchset includes powerpc kernel side changes.
Patch 1 introduces the HTM PMU interface with event configuration
support for specifying target hardware (node, chip, core) and trace
type. It integrates H_HTM hcall wrappers for starting, stopping, and
controlling HTM trace collection.
Patch 2 implements AUX buffer infrastructure by adding setup_aux and
free_aux callbacks, enabling perf to allocate and manage auxiliary
buffers for HTM data.
Patch 3 extends trace collection to capture system memory
configuration alongside trace data, using PERF_SAMPLE_RAW records to
mark data boundaries in the AUX buffer for post-processing.
Patches 4 and 5 provide comprehensive documentation for the HTM PMU
interface, including ABI documentation for sysfs attributes and
user-facing documentation with usage examples and workflows.
perf tools side patches will be posted separately in the linux-perf-users
mailing list.
Event Configuration:
Use "perf record" with the htm PMU event. The event is configured using
named parameters that specify the target hardware location and trace type:
- htm_type
- Type of HTM trace to collect (bits 0-3)
- nodeindex
- Node index in the system topology (bits 4-11)
- nodalchipindex
- Chip index within the specified node (bits 12-19)
- coreindexonchip
- Core index on the specified chip (bits 20-27)
event: "config:0-27"
htm_type: "config:0-3"
nodeindex: "config:4-11"
nodalchipindex: "config:12-19"
coreindexonchip: "config:20-27"
1) nodeindex, nodalchipindex, coreindexonchip: this specifies
which partition to configure the HTM for.
2) htmtype: specifies the type of HTM.
Event Syntax:
The event configuration uses named parameters::
htm/nodeindex=N,nodalchipindex=C,coreindexonchip=R,htm_type=T/
Where:
- N = node index
- C = chip index within the node
- R = core index on the chip
- T = HTM type
Basic Usage Example:
To collect HTM trace data for a specific chip:
# perf record -C 1 -e htm/nodalchipindex=2,nodeindex=0,htm_type=1/ <workload>
In this example:
- nodeindex=0: Target node 0
- nodalchipindex=2: Target chip 2 within node 0
- htm_type=1: HTM trace type 1
Output Files:
After running "perf record", the following files are generated:
# ls htm.bin.*
htm.bin.n0.p2.c0 htm.bin.n1.p3.c0 # Binary trace files
# ls translation.*
translation.n0.p2.c0 translation.n1.p3.c0 # Memory configuration files
These files contain:
- **htm.bin.*** - Raw HTM trace data in binary format
- **translation.*** - Memory address translation information for decoding
Trace Data Processing:
Process the collected trace data using perf script:
# perf script -D
This command:
1. Reads the perf.data file
2. Decodes HTM trace data using translation files
3. Displays human-readable trace output
The decoder automatically:
- Translates physical addresses to logical addresses
- Creates decoded output files for analysis
- Correlates trace data with memory mappings
Here's a complete example of collecting and analyzing HTM traces:
# Step 1: Collect trace data
perf record -C 1 -e htm/nodalchipindex=2,nodeindex=0,htm_type=1/ sleep 5
# Step 2: Verify output files
ls htm.bin.* # Binary trace files
ls translation.* # Memory configuration files
ls perf.data # Perf data file
# Step 3: Decode and view traces
perf script -D > decoded_trace.txt
# Step 4: Analyze with perf report to see the hot logical address
perf report
Thanks
Athira
Athira Rajeev (5):
powerpc/htm: Add interface to expose HTM trace data via perf
powerpc/htm: Add support to setup and free aux buffer for capturing
HTM data
powerpc/perf: Capture the HTM memory configuration as part of perf
data
docs: ABI: sysfs-bus-event_source-devices-htm: Document sysfs event
format entries for htm pmu
powerpc/perf/htm: Add documentation for Hardware Trace Macro PMU
.../sysfs-bus-event_source-devices-htm | 21 +
Documentation/arch/powerpc/htm.rst | 137 ++++-
arch/powerpc/perf/Makefile | 2 +-
arch/powerpc/perf/htm-perf.c | 571 ++++++++++++++++++
4 files changed, 727 insertions(+), 4 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-htm
create mode 100644 arch/powerpc/perf/htm-perf.c
--
2.52.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/5] powerpc/htm: Add interface to expose HTM trace data via perf
2026-07-01 8:38 [PATCH 0/5] powerpc/htm: Add interface to expose HTM trace data via perf Athira Rajeev
@ 2026-07-01 8:38 ` Athira Rajeev
2026-07-01 8:50 ` sashiko-bot
2026-07-01 8:38 ` [PATCH 2/5] powerpc/htm: Add support to setup and free aux buffer for capturing HTM data Athira Rajeev
` (3 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: Athira Rajeev @ 2026-07-01 8:38 UTC (permalink / raw)
To: linuxppc-dev, maddy
Cc: linux-perf-users, atrajeev, hbathini, tejas05, venkat88, tshah
H_HTM (Hardware Trace Macro) hypervisor call is an HCALL to export data
from Hardware Trace Macro (HTM) function. Add support for setup,
configuration and control of HTM function via PMU.
H_HTM is used as an interface for executing Hardware Trace Macro (HTM)
functions, including setup, configuration, control and dumping of the
HTM data. HTM operations can be controlled using the H_HTM hcall. The
hcall can be invoked for any core/chip of the system from within a
partition itself.
To use this, expose event as part of "htm" PMU. The event code or config
is 28 bit value, where user can specify below required fields:
event: "config:0-27"
htm_type: "config:0-3"
nodeindex: "config:4-11"
nodalchipindex: "config:12-19"
coreindexonchip: "config:20-27"
1) nodeindex, nodalchipindex, coreindexonchip: this specifies
which partition to configure the HTM for.
2) htmtype: specifies the type of HTM.
In htm_event_add: configure and start the tracing using htm_hcall_wrapper
which is defined in plpar_wrappers.h header file
In htm_event_del: stop and deconfigure the tracing using
htm_hcall_wrapper
With the changes:
# ls /sys/bus/event_source/devices/ |grep htm
htm
# ls /sys/bus/event_source/devices/htm/
events format perf_event_mux_interval_ms power subsystem type uevent
Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com>
---
arch/powerpc/perf/Makefile | 2 +-
arch/powerpc/perf/htm-perf.c | 307 +++++++++++++++++++++++++++++++++++
2 files changed, 308 insertions(+), 1 deletion(-)
create mode 100644 arch/powerpc/perf/htm-perf.c
diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 78dd7e25219e..26ef30c0693c 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -14,7 +14,7 @@ obj-$(CONFIG_PPC_POWERNV) += imc-pmu.o
obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
obj-$(CONFIG_FSL_EMB_PERF_EVENT_E500) += e500-pmu.o e6500-pmu.o
-obj-$(CONFIG_HV_PERF_CTRS) += hv-24x7.o hv-gpci.o hv-common.o vpa-dtl.o
+obj-$(CONFIG_HV_PERF_CTRS) += hv-24x7.o hv-gpci.o hv-common.o vpa-dtl.o htm-perf.o
obj-$(CONFIG_VPA_PMU) += vpa-pmu.o
diff --git a/arch/powerpc/perf/htm-perf.c b/arch/powerpc/perf/htm-perf.c
new file mode 100644
index 000000000000..e22a7fdce2f5
--- /dev/null
+++ b/arch/powerpc/perf/htm-perf.c
@@ -0,0 +1,307 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Perf interface to expose HTM Trace data.
+ *
+ * Copyright (C) 2025 Athira Rajeev, IBM Corporation
+ */
+
+#define pr_fmt(fmt) "htm: " fmt
+
+#include <asm/dtl.h>
+#include <linux/perf_event.h>
+#include <asm/plpar_wrappers.h>
+#include <linux/vmalloc.h>
+
+extern void perf_event_wakeup(struct perf_event *event);
+#define EVENT(_name, _code) enum{_name = _code}
+
+/*
+ * H_HTM (Hardware Trace Macro) hypervisor call is an HCALL to export
+ * data from Hardware Trace Macro (HTM) function.
+ *
+ * Event codes based on HTM type.
+ */
+EVENT(HTM_CORE, 0x2);
+EVENT(HTM_NEST, 0x1);
+
+GENERIC_EVENT_ATTR(htm_core, HTM_CORE);
+GENERIC_EVENT_ATTR(htm_nest, HTM_NEST);
+
+PMU_FORMAT_ATTR(event, "config:0-27");
+PMU_FORMAT_ATTR(htm_type, "config:0-3");
+PMU_FORMAT_ATTR(nodeindex, "config:4-11");
+PMU_FORMAT_ATTR(nodalchipindex, "config:12-19");
+PMU_FORMAT_ATTR(coreindexonchip, "config:20-27");
+
+static struct attribute *events_attr[] = {
+ GENERIC_EVENT_PTR(HTM_NEST),
+ GENERIC_EVENT_PTR(HTM_CORE),
+ NULL
+};
+
+static struct attribute_group event_group = {
+ .name = "events",
+ .attrs = events_attr,
+};
+
+static struct attribute *format_attrs[] = {
+ &format_attr_event.attr,
+ &format_attr_htm_type.attr,
+ &format_attr_nodeindex.attr,
+ &format_attr_nodalchipindex.attr,
+ &format_attr_coreindexonchip.attr,
+ NULL,
+};
+
+static const struct attribute_group format_group = {
+ .name = "format",
+ .attrs = format_attrs,
+};
+
+static const struct attribute_group *attr_groups[] = {
+ &format_group,
+ &event_group,
+ NULL,
+};
+
+static u64 htmflags = H_HTM_FLAGS_NOWRAP;
+
+/*
+ * Check the return code for H_HTM hcall.
+ * Return non-zero value (1) if either H_PARTIAL or H_SUCCESS
+ * is returned. For other return codes:
+ * Return zero if H_NOT_AVAILABLE.
+ * Return -EBUSY if hcall return busy.
+ * Return -EINVAL if any parameter or operation is not valid.
+ * Return -EPERM if HTM Virtualization Engine Technology code
+ * is not applied.
+ * Return -EIO if the HTM state is not valid.
+ */
+static ssize_t htm_return_check(int rc)
+{
+ switch (rc) {
+ case H_SUCCESS:
+ break;
+ /* H_PARTIAL for the case where all available data can't be
+ * returned due to buffer size constraint.
+ */
+ case H_PARTIAL:
+ break;
+ /* H_NOT_AVAILABLE indicates reading from an offset outside the range,
+ * i.e. past end of file.
+ */
+ case H_NOT_AVAILABLE:
+ return 0;
+ case H_BUSY:
+ case H_LONG_BUSY_ORDER_1_MSEC:
+ case H_LONG_BUSY_ORDER_10_MSEC:
+ case H_LONG_BUSY_ORDER_100_MSEC:
+ case H_LONG_BUSY_ORDER_1_SEC:
+ case H_LONG_BUSY_ORDER_10_SEC:
+ case H_LONG_BUSY_ORDER_100_SEC:
+ return -EBUSY;
+ case H_PARAMETER:
+ goto out;
+ case H_P2:
+ goto out;
+ case H_P3:
+ goto out;
+ case H_P4:
+ goto out;
+ case H_P5:
+ goto out;
+ case H_P6:
+ return -EINVAL;
+ case H_STATE:
+ return -EIO;
+ case H_AUTHORITY:
+ return -EPERM;
+ }
+
+ /*
+ * Return 1 for H_SUCCESS/H_PARTIAL
+ */
+ return 1;
+out:
+ return -EINVAL;
+}
+
+static int htm_event_init(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ u64 config = event->attr.config;
+ u32 htmtype;
+
+ if (event->attr.inherit)
+ return -EOPNOTSUPP;
+
+ /* test the event attr type for PMU enumeration */
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ if (!perfmon_capable())
+ return -EACCES;
+
+ /* Return if this is a counting event */
+ if (!is_sampling_event(event))
+ return -EOPNOTSUPP;
+
+ /* no branch sampling */
+ if (has_branch_stack(event))
+ return -EOPNOTSUPP;
+
+ htmtype = config & 0xf;
+ /* Invalid eventcode */
+ switch (htmtype) {
+ case HTM_CORE:
+ case HTM_NEST:
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ htmflags = H_HTM_FLAGS_NOWRAP;
+
+ if (event->attr.freq) {
+ hwc->sample_period = event->attr.sample_period;
+ local64_set(&hwc->period_left, hwc->sample_period);
+ hwc->last_period = hwc->sample_period;
+ event->attr.freq = 0;
+ }
+
+ return 0;
+}
+
+static int htm_event_add(struct perf_event *event, int flags)
+{
+ int rc, ret;
+ unsigned long param1 = -1, param2 = -1;
+ int retries = 0;
+ u64 config = event->attr.config;
+ u32 htmtype, nodeindex, nodalchipindex, coreindexonchip;
+
+ /*
+ * Invoke H_HTM call with:
+ * operation as htm configure (H_HTM_OP_CONFIGURE)
+ * last three values are unused, hence set to zero
+ */
+ htmtype = config & 0xf;
+ nodeindex = (config >> 4) & 0xff;
+ nodalchipindex = (config >> 12) & 0xff;
+ coreindexonchip = (config >> 20) & 0xff;
+ do {
+ rc = htm_hcall_wrapper(htmflags, nodeindex, nodalchipindex, coreindexonchip,
+ htmtype, H_HTM_OP_CONFIGURE, param1, param2, 0);
+ ret = htm_return_check(rc);
+ } while (ret <= 0 && ++retries < 100);
+ if (ret <= 0)
+ return -1;
+
+ /* Reset retries */
+ retries = 0;
+
+ /*
+ * Invoke H_HTM call with:
+ * operation as htm start (H_HTM_OP_START)
+ * last three values are unused, hence set to zero
+ */
+ do {
+ rc = htm_hcall_wrapper(htmflags, nodeindex, nodalchipindex, coreindexonchip,
+ htmtype, H_HTM_OP_START, 0, 0, 0);
+ ret = htm_return_check(rc);
+ } while (ret == -EBUSY && ++retries < 100);
+
+ if (htm_return_check(rc) <= 0)
+ return -1;
+
+ return 0;
+}
+
+static void htm_event_del(struct perf_event *event, int flags)
+{
+ long rc;
+ int ret;
+ int retries = 0;
+ u64 config = event->attr.config;
+ u32 htmtype, nodeindex, nodalchipindex, coreindexonchip;
+
+ /*
+ * Invoke H_HTM call with:
+ * operation as htm stop (H_HTM_OP_STOP)
+ * last three values are unused, hence set to zero
+ */
+ htmtype = config & 0xf;
+ nodeindex = (config >> 4) & 0xff;
+ nodalchipindex = (config >> 12) & 0xff;
+ coreindexonchip = (config >> 20) & 0xff;
+ do {
+ rc = htm_hcall_wrapper(htmflags, nodeindex, nodalchipindex, coreindexonchip,
+ htmtype, H_HTM_OP_STOP, 0, 0, 0);
+ ret = htm_return_check(rc);
+ } while (ret == -EBUSY && ++retries < 100);
+
+ /* Reset retries */
+ retries = 0;
+
+ /*
+ * Invoke H_HTM call with:
+ * operation as htm configure (H_HTM_OP_DECONFIGURE)
+ * last three values are unused, hence set to zero
+ */
+ do {
+ rc = htm_hcall_wrapper(htmflags, nodeindex, nodalchipindex, coreindexonchip,
+ htmtype, H_HTM_OP_DECONFIGURE, 0, 0, 0);
+ ret = htm_return_check(rc);
+ } while (ret <= 0 && ++retries < 100);
+}
+
+/*
+ * This function definition is empty as htm_dump_sample_data
+ * is used to parse and dump the HTM trace data,
+ * to perf data.
+ */
+static void htm_event_read(struct perf_event *event)
+{
+ return;
+}
+
+static void htm_event_start(struct perf_event *event, int flags)
+{
+}
+
+static void htm_event_stop(struct perf_event *event, int flags)
+{
+}
+
+static struct pmu htm_pmu = {
+ .task_ctx_nr = perf_invalid_context,
+
+ .name = "htm",
+ .attr_groups = attr_groups,
+ .event_init = htm_event_init,
+ .add = htm_event_add,
+ .del = htm_event_del,
+ .read = htm_event_read,
+ .start = htm_event_start,
+ .stop = htm_event_stop,
+ .capabilities = PERF_PMU_CAP_NO_EXCLUDE | PERF_PMU_CAP_EXCLUSIVE,
+};
+
+static int htm_init(void)
+{
+ int r;
+
+ /* This driver is intended only for L1 host. */
+ if (is_kvm_guest()) {
+ pr_debug("Only supported for L1 host system\n");
+ return -ENODEV;
+ }
+
+ r = perf_pmu_register(&htm_pmu, htm_pmu.name, -1);
+ if (r)
+ return r;
+
+ return 0;
+}
+
+device_initcall(htm_init);
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/5] powerpc/htm: Add support to setup and free aux buffer for capturing HTM data
2026-07-01 8:38 [PATCH 0/5] powerpc/htm: Add interface to expose HTM trace data via perf Athira Rajeev
2026-07-01 8:38 ` [PATCH 1/5] " Athira Rajeev
@ 2026-07-01 8:38 ` Athira Rajeev
2026-07-01 8:50 ` sashiko-bot
2026-07-01 8:38 ` [PATCH 3/5] powerpc/perf: Capture the HTM memory configuration as part of perf data Athira Rajeev
` (2 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: Athira Rajeev @ 2026-07-01 8:38 UTC (permalink / raw)
To: linuxppc-dev, maddy
Cc: linux-perf-users, atrajeev, hbathini, tejas05, venkat88, tshah
HTM trace data is saved to perf.data when monitoring completes.
We directly copy the trace data as part of auxiliary buffer and it
will be postprocessed later. To enable the support for aux buffer,
add the PMU callbacks for setup_aux and free_aux.
In setup_aux, set up pmu-private data structures for an AUX
area. rb_alloc_aux uses "alloc_pages_node" and returns pointer to each
page address. "struct htm_pmu_buf" mainly saves:
1. buf->base: aux buffer base address
2. buf->head: offset from base address where data will be written to.
3. buf->size: Size of allocated memory
free_aux will free pmu-private AUX data structures.
Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com>
---
arch/powerpc/perf/htm-perf.c | 162 ++++++++++++++++++++++++++++++++++-
1 file changed, 160 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/perf/htm-perf.c b/arch/powerpc/perf/htm-perf.c
index e22a7fdce2f5..ae7f469b6840 100644
--- a/arch/powerpc/perf/htm-perf.c
+++ b/arch/powerpc/perf/htm-perf.c
@@ -66,6 +66,23 @@ static const struct attribute_group *attr_groups[] = {
static u64 htmflags = H_HTM_FLAGS_NOWRAP;
+struct htm_pmu_buf {
+ int nr_pages;
+ bool snapshot;
+ void *base;
+ u64 size;
+ u64 head;
+ u64 head_size;
+ bool full;
+ int htm_stopped;
+ int collect_htm_trace;
+};
+
+struct htm_pmu_ctx {
+ struct perf_output_handle handle;
+};
+
+static DEFINE_PER_CPU(struct htm_pmu_ctx, htm_pmu_ctx);
/*
* Check the return code for H_HTM hcall.
* Return non-zero value (1) if either H_PARTIAL or H_SUCCESS
@@ -126,6 +143,74 @@ static ssize_t htm_return_check(int rc)
return -EINVAL;
}
+static int htm_dump_sample_data(struct perf_event *event)
+{
+ struct htm_pmu_ctx *htm_ctx = this_cpu_ptr(&htm_pmu_ctx);
+ struct htm_pmu_buf *aux_buf;
+ u64 config = event->attr.config;
+ u32 htmtype, nodeindex, nodalchipindex, coreindexonchip;
+ long rc;
+ int ret = 0;
+ int retries = 0;
+
+ htmtype = config & 0xf;
+ nodeindex = (config >> 4) & 0xff;
+ nodalchipindex = (config >> 12) & 0xff;
+ coreindexonchip = (config >> 20) & 0xff;
+
+ aux_buf = perf_aux_output_begin(&htm_ctx->handle, event);
+ if (!aux_buf)
+ return -1;
+
+ if (!aux_buf->collect_htm_trace) {
+ perf_aux_output_end(&htm_ctx->handle, 0);
+ return 0;
+ }
+
+ if (!aux_buf->htm_stopped) {
+ do {
+ rc = htm_hcall_wrapper(htmflags, nodeindex, nodalchipindex, coreindexonchip,
+ htmtype, H_HTM_OP_STOP, 0, 0, 0);
+ ret = htm_return_check(rc);
+ } while (ret == -EBUSY && ++retries < 100);
+
+ if (ret > 0) {
+ /* HTM stopped trace collection */
+ aux_buf->htm_stopped = 1;
+ } else {
+ /* Failed to stop tracing, don't proceed to trace collection */
+ perf_aux_output_end(&htm_ctx->handle, 0);
+ return ret;
+ }
+ /* Reset the retries */
+ retries = 0;
+ }
+
+ /*
+ * Invoke H_HTM call with:
+ * - operation as htm dump (H_HTM_OP_DUMP_DATA)
+ * - last three values are address, size and offset
+ */
+ if (aux_buf->collect_htm_trace) {
+ do {
+ rc = htm_hcall_wrapper(htmflags, nodeindex, nodalchipindex, coreindexonchip,
+ htmtype, H_HTM_OP_DUMP_DATA, virt_to_phys(aux_buf->base),
+ (aux_buf->nr_pages * PAGE_SIZE), aux_buf->head);
+ ret = htm_return_check(rc);
+ } while (ret == -EBUSY && ++retries < 100);
+
+ if (ret > 0) {
+ aux_buf->head += (aux_buf->nr_pages * PAGE_SIZE);
+ perf_aux_output_end(&htm_ctx->handle, (aux_buf->nr_pages * PAGE_SIZE));
+ } else {
+ aux_buf->collect_htm_trace = 0;
+ perf_aux_output_end(&htm_ctx->handle, 0);
+ }
+ }
+
+ return ret;
+}
+
static int htm_event_init(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
@@ -262,7 +347,77 @@ static void htm_event_del(struct perf_event *event, int flags)
*/
static void htm_event_read(struct perf_event *event)
{
- return;
+ int ret;
+
+ if (event->state != PERF_EVENT_STATE_ACTIVE)
+ return;
+
+ ret = htm_dump_sample_data(event);
+
+ if (ret <= 0)
+ local64_set(&event->count, 0);
+ else
+ local64_set(&event->count, 1);
+}
+
+/*
+ * Set up pmu-private data structures for an AUX area
+ * **pages contains the aux buffer allocated for this event
+ * for the corresponding cpu. rb_alloc_aux uses "alloc_pages_node"
+ * and returns pointer to each page address. Map these pages to
+ * contiguous space using vmap and use that as base address.
+ *
+ * The aux private data structure ie, "struct htm_pmu_buf" mainly
+ * saves
+ * - buf->base: aux buffer base address
+ * - buf->head: offset from base address where data will be written to.
+ * - buf->size: Size of allocated memory
+ */
+static void *htm_setup_aux(struct perf_event *event, void **pages,
+ int nr_pages, bool snapshot)
+{
+ int cpu = event->cpu;
+ struct htm_pmu_buf *buf;
+
+ /* We need at least one page for this to work. */
+ if (!nr_pages)
+ return NULL;
+
+ if (cpu == -1)
+ cpu = raw_smp_processor_id();
+
+ buf = kzalloc_node(sizeof(*buf), GFP_KERNEL, cpu_to_node(cpu));
+ if (!buf)
+ return NULL;
+
+ buf->base = pages[0];
+
+ if (!buf->base) {
+ kfree(buf);
+ return NULL;
+ }
+
+ buf->nr_pages = nr_pages;
+ buf->snapshot = false;
+ buf->size = nr_pages << PAGE_SHIFT;
+ buf->head = 0;
+ buf->head_size = 0;
+ buf->htm_stopped = 0;
+ buf->collect_htm_trace = 1;
+ return buf;
+}
+
+/*
+ * free pmu-private AUX data structures
+ */
+static void htm_free_aux(void *aux)
+{
+ struct htm_pmu_buf *buf = aux;
+
+ if (!buf)
+ return;
+
+ kfree(buf);
}
static void htm_event_start(struct perf_event *event, int flags)
@@ -284,7 +439,10 @@ static struct pmu htm_pmu = {
.read = htm_event_read,
.start = htm_event_start,
.stop = htm_event_stop,
- .capabilities = PERF_PMU_CAP_NO_EXCLUDE | PERF_PMU_CAP_EXCLUSIVE,
+ .setup_aux = htm_setup_aux,
+ .free_aux = htm_free_aux,
+ .capabilities = PERF_PMU_CAP_NO_EXCLUDE | PERF_PMU_CAP_EXCLUSIVE
+ | PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_AUX_PREFER_LARGE,
};
static int htm_init(void)
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/5] powerpc/perf: Capture the HTM memory configuration as part of perf data
2026-07-01 8:38 [PATCH 0/5] powerpc/htm: Add interface to expose HTM trace data via perf Athira Rajeev
2026-07-01 8:38 ` [PATCH 1/5] " Athira Rajeev
2026-07-01 8:38 ` [PATCH 2/5] powerpc/htm: Add support to setup and free aux buffer for capturing HTM data Athira Rajeev
@ 2026-07-01 8:38 ` Athira Rajeev
2026-07-01 9:02 ` sashiko-bot
2026-07-01 8:38 ` [PATCH 4/5] docs: ABI: sysfs-bus-event_source-devices-htm: Document sysfs event format entries for htm pmu Athira Rajeev
2026-07-01 8:38 ` [PATCH 5/5] powerpc/perf/htm: Add documentation for Hardware Trace Macro PMU Athira Rajeev
4 siblings, 1 reply; 9+ messages in thread
From: Athira Rajeev @ 2026-07-01 8:38 UTC (permalink / raw)
To: linuxppc-dev, maddy
Cc: linux-perf-users, atrajeev, hbathini, tejas05, venkat88, tshah
H_HTM (Hardware Trace Macro) hypervisor call has capability
to capture SystemMemory Configuration for a system. This
information helps to understand the physical to logical real
address mapping for the logical partitions in the system.
Along with saving HTM trace data, add support to capture
the memory mapping information also using the hcall.
Patch adds support in perf driver to expose HTM memory
configuration as part of perf.data
When monitoring the HTM pmu, auxiliary buffer captures
the "trace" data and SystemMemory Configuration. This
will be post processed later using perf. The size of memory
mapping data captured depends on how large is the system
and how much memory is allocated. To help with relating
and identifying the start of memory mapping data in the
auxiliary buffer, insert two PERF_SAMPLE_RAW records in the
ring buffer. First PERF_SAMPLE_RAW record will mark the
beginning of system memory mapping data in aux buffer. And second
PERF_SAMPLE_RAW record will be written at the end to make the
end of the data in aux buffer and also contains the total size
of the memory map data. These sample raw records
will be used during post processing in perf report.
Use sample raw to mark memory mapping in aux buffer.
Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com>
---
arch/powerpc/perf/htm-perf.c | 110 ++++++++++++++++++++++++++++++++++-
1 file changed, 108 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/perf/htm-perf.c b/arch/powerpc/perf/htm-perf.c
index ae7f469b6840..fe458bc3ec05 100644
--- a/arch/powerpc/perf/htm-perf.c
+++ b/arch/powerpc/perf/htm-perf.c
@@ -76,6 +76,10 @@ struct htm_pmu_buf {
bool full;
int htm_stopped;
int collect_htm_trace;
+ u64 mem_head;
+ void *htm_mem_buf;
+ u64 mem_start;
+ int collect_htm_mem;
};
struct htm_pmu_ctx {
@@ -143,6 +147,86 @@ static ssize_t htm_return_check(int rc)
return -EINVAL;
}
+static int htm_collect_memory_config(struct perf_event *event,
+ struct htm_pmu_buf *aux_buf)
+{
+ struct perf_sample_data data;
+ struct perf_raw_record raw;
+ struct pt_regs regs;
+ u64 *num_entries;
+ u64 to_copy = 0;
+ int htm_val;
+ long rc;
+ int ret;
+ int retries = 0;
+ size_t size;
+ size_t space_to_end = aux_buf->size - aux_buf->mem_head;
+
+ /* Capture HTM system memory configuration in aux buffer */
+ do {
+ rc = htm_hcall_wrapper(htmflags, 0, 0, 0,
+ 0, H_HTM_OP_DUMP_SYSMEM_CONF, virt_to_phys(aux_buf->htm_mem_buf),
+ PAGE_SIZE, aux_buf->mem_start);
+ ret = htm_return_check(rc);
+ } while (ret == -EBUSY && ++retries < 100);
+
+ /* Return once there is no more data in HTM buffer */
+ if (ret <= 0) {
+ perf_sample_data_init(&data, 0, event->hw.last_period);
+ memset(&raw, 0, sizeof(raw));
+ memset(®s, 0, sizeof(regs));
+
+ htm_val = (aux_buf->head/((aux_buf->nr_pages * PAGE_SIZE)));
+ raw.frag.data = &htm_val;
+ raw.frag.size = sizeof(htm_val);
+
+ aux_buf->collect_htm_mem = 0;
+ perf_sample_save_raw_data(&data, event, &raw);
+ perf_event_overflow(event, &data, ®s);
+ return 0;
+ }
+
+ /*
+ * Find how much data to copy to aux buffer
+ * If hcall returned H_PARTIAL, set mem_start to
+ * indicate next offset of memory to read from
+ */
+ num_entries = aux_buf->htm_mem_buf + 0x10;
+ aux_buf->mem_start = be64_to_cpu(*(u64 *)(aux_buf->htm_mem_buf + 0x8));
+
+ to_copy = 32 + (be64_to_cpu(*num_entries) * 32);
+
+ if (to_copy <= space_to_end) {
+ if ((to_copy + aux_buf->mem_head) >= ((aux_buf->nr_pages * PAGE_SIZE)/2)) {
+ /*
+ * Crossing 50% threshold - flush and wrap.
+ * Write current chunk, then pad to end of buffer.
+ * This ensures next write starts at beginning with
+ * perf head also at beginning (synchronized).
+ */
+ memcpy(aux_buf->base + aux_buf->mem_head, aux_buf->htm_mem_buf, to_copy);
+ aux_buf->mem_head = 0;
+
+ /*
+ * Return space_to_end to include padding.
+ * Perf will advance head to end (wrapping to 0),
+ * matching our mem_head position.
+ */
+ size = space_to_end;
+ } else {
+ /* Normal case - chunk fits without crossing threshold */
+ memcpy(aux_buf->base + aux_buf->mem_head, aux_buf->htm_mem_buf, to_copy);
+ aux_buf->mem_head += to_copy;
+ size = to_copy;
+ }
+ } else {
+ return 0;
+ }
+
+ /* Return non-zero to indicate that one record is written to aux buffer */
+ return size;
+}
+
static int htm_dump_sample_data(struct perf_event *event)
{
struct htm_pmu_ctx *htm_ctx = this_cpu_ptr(&htm_pmu_ctx);
@@ -162,7 +246,7 @@ static int htm_dump_sample_data(struct perf_event *event)
if (!aux_buf)
return -1;
- if (!aux_buf->collect_htm_trace) {
+ if (!aux_buf->collect_htm_mem && !aux_buf->collect_htm_trace) {
perf_aux_output_end(&htm_ctx->handle, 0);
return 0;
}
@@ -202,12 +286,17 @@ static int htm_dump_sample_data(struct perf_event *event)
if (ret > 0) {
aux_buf->head += (aux_buf->nr_pages * PAGE_SIZE);
perf_aux_output_end(&htm_ctx->handle, (aux_buf->nr_pages * PAGE_SIZE));
+ return ret;
} else {
aux_buf->collect_htm_trace = 0;
- perf_aux_output_end(&htm_ctx->handle, 0);
}
}
+ if (aux_buf->collect_htm_mem) {
+ ret = htm_collect_memory_config(event, aux_buf);
+ perf_aux_output_end(&htm_ctx->handle, ret);
+ }
+
return ret;
}
@@ -397,6 +486,13 @@ static void *htm_setup_aux(struct perf_event *event, void **pages,
return NULL;
}
+ buf->htm_mem_buf = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
+ if (!buf->htm_mem_buf) {
+ kfree(buf);
+ pr_err("Failed to allocate htm mem buf\n");
+ return NULL;
+ }
+
buf->nr_pages = nr_pages;
buf->snapshot = false;
buf->size = nr_pages << PAGE_SHIFT;
@@ -404,6 +500,9 @@ static void *htm_setup_aux(struct perf_event *event, void **pages,
buf->head_size = 0;
buf->htm_stopped = 0;
buf->collect_htm_trace = 1;
+ buf->mem_head = 0;
+ buf->collect_htm_mem = 1;
+ buf->mem_start = 0;
return buf;
}
@@ -413,10 +512,17 @@ static void *htm_setup_aux(struct perf_event *event, void **pages,
static void htm_free_aux(void *aux)
{
struct htm_pmu_buf *buf = aux;
+ void *free_mem;
if (!buf)
return;
+ free_mem = buf->htm_mem_buf;
+ buf->htm_mem_buf = NULL;
+
+ smp_mb();
+
+ kfree(free_mem);
kfree(buf);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 4/5] docs: ABI: sysfs-bus-event_source-devices-htm: Document sysfs event format entries for htm pmu
2026-07-01 8:38 [PATCH 0/5] powerpc/htm: Add interface to expose HTM trace data via perf Athira Rajeev
` (2 preceding siblings ...)
2026-07-01 8:38 ` [PATCH 3/5] powerpc/perf: Capture the HTM memory configuration as part of perf data Athira Rajeev
@ 2026-07-01 8:38 ` Athira Rajeev
2026-07-01 8:38 ` [PATCH 5/5] powerpc/perf/htm: Add documentation for Hardware Trace Macro PMU Athira Rajeev
4 siblings, 0 replies; 9+ messages in thread
From: Athira Rajeev @ 2026-07-01 8:38 UTC (permalink / raw)
To: linuxppc-dev, maddy
Cc: linux-perf-users, atrajeev, hbathini, tejas05, venkat88, tshah
Details are added for the htm pmu event and format
attributes in the ABI documentation.
Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com>
---
.../sysfs-bus-event_source-devices-htm | 21 +++++++++++++++++++
1 file changed, 21 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-htm
diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-htm b/Documentation/ABI/testing/sysfs-bus-event_source-devices-htm
new file mode 100644
index 000000000000..784ba7c31b89
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-htm
@@ -0,0 +1,21 @@
+What: /sys/bus/event_source/devices/htm/format
+Date: June 2026
+Contact: Linux on PowerPC Developer List <linuxppc-dev at lists.ozlabs.org>
+Description: Read-only. Attribute group to describe the magic bits
+ that go into perf_event_attr.config for a particular pmu.
+ (See ABI/testing/sysfs-bus-event_source-devices-format).
+
+ Each attribute under this group defines a bit range of the
+ perf_event_attr.config. Supported attribute are listed
+ below::
+
+ event = "config:0-27" - event ID
+
+What: /sys/bus/event_source/devices/htm/events
+Date: June 2026
+Contact: Linux on PowerPC Developer List <linuxppc-dev at lists.ozlabs.org>
+Description: (RO) Attribute group to describe performance monitoring events
+ for the Hardware Trace Macro (HTM) trace. Each attribute in
+ this group describes a single performance monitoring event
+ supported by htm pmu. The name of the file is the name of
+ the event (See ABI/testing/sysfs-bus-event_source-devices-events).
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 5/5] powerpc/perf/htm: Add documentation for Hardware Trace Macro PMU
2026-07-01 8:38 [PATCH 0/5] powerpc/htm: Add interface to expose HTM trace data via perf Athira Rajeev
` (3 preceding siblings ...)
2026-07-01 8:38 ` [PATCH 4/5] docs: ABI: sysfs-bus-event_source-devices-htm: Document sysfs event format entries for htm pmu Athira Rajeev
@ 2026-07-01 8:38 ` Athira Rajeev
4 siblings, 0 replies; 9+ messages in thread
From: Athira Rajeev @ 2026-07-01 8:38 UTC (permalink / raw)
To: linuxppc-dev, maddy
Cc: linux-perf-users, atrajeev, hbathini, tejas05, venkat88, tshah
Documentation for htm (Hardware Trace Macro - HTM)
PMU interface. And how it can be used to collect the HTM traces
entries in perf data, how to process/report as part of perf report/perf
script.
Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com>
---
Documentation/arch/powerpc/htm.rst | 137 ++++++++++++++++++++++++++++-
1 file changed, 134 insertions(+), 3 deletions(-)
diff --git a/Documentation/arch/powerpc/htm.rst b/Documentation/arch/powerpc/htm.rst
index fcb4eb6306b1..f9dceffb93c6 100644
--- a/Documentation/arch/powerpc/htm.rst
+++ b/Documentation/arch/powerpc/htm.rst
@@ -18,9 +18,10 @@ H_HTM is used as an interface for executing Hardware Trace Macro (HTM)
functions, including setup, configuration, control and dumping of the HTM data.
For using HTM, it is required to setup HTM buffers and HTM operations can
be controlled using the H_HTM hcall. The hcall can be invoked for any core/chip
-of the system from within a partition itself. To use this feature, a debugfs
-folder called "htmdump" is present under /sys/kernel/debug/powerpc.
+of the system from within a partition itself.
+To use this feature, a debugfs folder called "htmdump" is present under
+/sys/kernel/debug/powerpc. Another interface is via perf.
HTM debugfs example usage
=========================
@@ -94,7 +95,137 @@ This trace file will contain the relevant instruction traces
collected during the workload execution. And can be used as
input file for trace decoders to understand data.
-Benefits of using HTM debugfs interface
+HTM perf interface usage
+========================
+
+The HTM (Hardware Trace Macro) perf interface enables collection and analysis
+of hardware trace data from PowerPC systems. This interface allows users to
+capture detailed execution traces for performance analysis and debugging.
+
+Event Configuration
+-------------------
+
+Use ``perf record`` with the htm PMU event. The event is configured using
+named parameters that specify the target hardware location and trace type:
+
+.. list-table::
+ :header-rows: 1
+ :widths: 25 75
+
+ * - Parameter
+ - Description
+ * - htm_type
+ - Type of HTM trace to collect (bits 0-3)
+ * - nodeindex
+ - Node index in the system topology (bits 4-11)
+ * - nodalchipindex
+ - Chip index within the specified node (bits 12-19)
+ * - coreindexonchip
+ - Core index on the specified chip (bits 20-27)
+
+- event: "config:0-27"
+- htm_type: "config:0-3"
+- nodeindex: "config:4-11"
+- nodalchipindex: "config:12-19"
+- coreindexonchip: "config:20-27"
+
+1) nodeindex, nodalchipindex, coreindexonchip: this specifies
+ which partition to configure the HTM for.
+2) htmtype: specifies the type of HTM.
+
+Event Syntax
+------------
+
+The event configuration uses named parameters::
+
+ htm/nodeindex=N,nodalchipindex=C,coreindexonchip=R,htm_type=T/
+
+Where:
+
+- N = node index
+- C = chip index within the node
+- R = core index on the chip
+- T = HTM type
+
+Basic Usage Example
+-------------------
+
+To collect HTM trace data for a specific chip:
+
+.. code-block:: sh
+
+ # perf record -C 1 -e htm/nodalchipindex=2,nodeindex=0,htm_type=1/ <workload>
+
+In this example:
+
+- ``-C 1``: Collect on CPU 1
+- ``nodeindex=0``: Target node 0
+- ``nodalchipindex=2``: Target chip 2 within node 0
+- ``htm_type=1``: HTM trace type 1
+
+Output Files
+------------
+
+After running ``perf record``, the following files are generated:
+
+.. code-block:: sh
+
+ # ls htm.bin.*
+ htm.bin.n0.p2.c0 htm.bin.n1.p3.c0 # Binary trace files
+
+ # ls translation.*
+ translation.n0.p2.c0 translation.n1.p3.c0 # Memory configuration files
+
+These files contain:
+
+- **htm.bin.*** - Raw HTM trace data in binary format
+- **translation.*** - Memory address translation information for decoding
+
+Trace Data Processing
+---------------------
+
+Process the collected trace data using perf script:
+
+.. code-block:: sh
+
+ # perf script -D
+
+This command:
+
+1. Reads the perf.data file
+2. Decodes HTM trace data using translation files
+3. Displays human-readable trace output
+4. Shows instruction addresses and execution flow
+
+The decoder automatically:
+
+- Translates physical addresses to logical addresses
+- Creates decoded output files for analysis
+- Correlates trace data with memory mappings
+
+Complete Workflow Example
+--------------------------
+
+Here's a complete example of collecting and analyzing HTM traces:
+
+.. code-block:: sh
+
+ # Step 1: Collect trace data
+ perf record -C 1 -e htm/nodalchipindex=2,nodeindex=0,htm_type=1/ sleep 5
+
+ # Step 2: Verify output files
+ ls htm.bin.* # Binary trace files
+ ls translation.* # Memory configuration files
+ ls perf.data # Perf data file
+
+ # Step 3: Decode and view traces
+ perf script -D > decoded_trace.txt
+
+ # Step 4: Analyze with perf report to see the hot logical address
+ perf report
+
+
+Benefits of using HTM interface
=======================================
It is now possible to collect traces for a particular core/chip
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/5] powerpc/htm: Add interface to expose HTM trace data via perf
2026-07-01 8:38 ` [PATCH 1/5] " Athira Rajeev
@ 2026-07-01 8:50 ` sashiko-bot
0 siblings, 0 replies; 9+ messages in thread
From: sashiko-bot @ 2026-07-01 8:50 UTC (permalink / raw)
To: Athira Rajeev; +Cc: linux-perf-users
Thank you for your contribution! Sashiko AI review found 7 potential issue(s) to consider:
- [High] Missing Firmware Feature Check allows PMU to register on bare-metal systems.
- [High] Improper Error Handling treats unhandled hypervisor errors as success.
- [High] Resource leak when `H_HTM_OP_START` fails.
- [High] Missing cpumask attribute leads to concurrent identical hypercalls.
- [Medium] Retry loops spin endlessly on deterministic errors.
- [Medium] PMU API violation for start/stop operations.
- [Medium] Improper handling of `sample_freq` corrupts the sampling logic.
--
commit 400fb092e994f5871c30e25b796b6ab855238ea4
Author: Athira Rajeev <atrajeev@linux.ibm.com>
powerpc/htm: Add interface to expose HTM trace data via perf
H_HTM (Hardware Trace Macro) hypervisor call is an HCALL to export data
from Hardware Trace Macro (HTM) function. Add support for setup,
configuration and control of HTM function via PMU.
> diff --git a/arch/powerpc/perf/htm-perf.c b/arch/powerpc/perf/htm-perf.c
> new file mode 100644
> --- /dev/null
> +++ b/arch/powerpc/perf/htm-perf.c
> [ ... ]
> +static ssize_t htm_return_check(int rc)
> +{
> + switch (rc) {
> + case H_SUCCESS:
> + break;
> [ ... ]
> + case H_AUTHORITY:
> + return -EPERM;
> + }
> +
> + /*
> + * Return 1 for H_SUCCESS/H_PARTIAL
> + */
> + return 1;
> +out:
> + return -EINVAL;
> +}
[Severity: High]
Will unhandled hypercall return codes fall through the switch statement
and return 1 (success)? If the hypervisor returns an unhandled error code,
this code might silently interpret critical failures as successes.
> [ ... ]
> +static int htm_event_init(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + u64 config = event->attr.config;
> + u32 htmtype;
> [ ... ]
> + if (event->attr.freq) {
> + hwc->sample_period = event->attr.sample_period;
> + local64_set(&hwc->period_left, hwc->sample_period);
> + hwc->last_period = hwc->sample_period;
> + event->attr.freq = 0;
> + }
> +
> + return 0;
> +}
[Severity: Medium]
Does copying the target frequency into the sample period break the sampling
logic? Since frequency-based sampling does not appear to be supported, should
this return -EINVAL instead of disabling dynamic frequency adjustment?
> [ ... ]
> +static int htm_event_add(struct perf_event *event, int flags)
> +{
> + int rc, ret;
> + unsigned long param1 = -1, param2 = -1;
> + int retries = 0;
> + u64 config = event->attr.config;
> + u32 htmtype, nodeindex, nodalchipindex, coreindexonchip;
> +
> + /*
> + * Invoke H_HTM call with:
> + * operation as htm configure (H_HTM_OP_CONFIGURE)
> + * last three values are unused, hence set to zero
> + */
> + htmtype = config & 0xf;
> + nodeindex = (config >> 4) & 0xff;
> + nodalchipindex = (config >> 12) & 0xff;
> + coreindexonchip = (config >> 20) & 0xff;
> + do {
> + rc = htm_hcall_wrapper(htmflags, nodeindex, nodalchipindex, coreindexonchip,
> + htmtype, H_HTM_OP_CONFIGURE, param1, param2, 0);
> + ret = htm_return_check(rc);
> + } while (ret <= 0 && ++retries < 100);
[Severity: Medium]
Does this loop retry 100 times for deterministic errors like -EINVAL and
-EPERM? The condition ret <= 0 causes it to spin needlessly on errors that
will not succeed on retry.
> + if (ret <= 0)
> + return -1;
> +
> + /* Reset retries */
> + retries = 0;
> +
> + /*
> + * Invoke H_HTM call with:
> + * operation as htm start (H_HTM_OP_START)
> + * last three values are unused, hence set to zero
> + */
> + do {
> + rc = htm_hcall_wrapper(htmflags, nodeindex, nodalchipindex, coreindexonchip,
> + htmtype, H_HTM_OP_START, 0, 0, 0);
> + ret = htm_return_check(rc);
> + } while (ret == -EBUSY && ++retries < 100);
> +
> + if (htm_return_check(rc) <= 0)
> + return -1;
[Severity: High]
If H_HTM_OP_START fails, does this return without cleaning up the
configuration? Because event_add fails, the perf core aborts event creation
and won't call event_del, potentially leaving the hypervisor state permanently
configured and leaking tracing resources.
[Severity: Medium]
Is it expected to execute the start operation unconditionally during event_add
instead of using the standard pmu->start callback? If the event is created in
a disabled state, it will immediately start tracing anyway.
> + return 0;
> +}
> +
> +static void htm_event_del(struct perf_event *event, int flags)
> +{
> + long rc;
> + int ret;
> + int retries = 0;
> + u64 config = event->attr.config;
> + u32 htmtype, nodeindex, nodalchipindex, coreindexonchip;
> [ ... ]
> + /* Reset retries */
> + retries = 0;
> +
> + /*
> + * Invoke H_HTM call with:
> + * operation as htm configure (H_HTM_OP_DECONFIGURE)
> + * last three values are unused, hence set to zero
> + */
> + do {
> + rc = htm_hcall_wrapper(htmflags, nodeindex, nodalchipindex, coreindexonchip,
> + htmtype, H_HTM_OP_DECONFIGURE, 0, 0, 0);
> + ret = htm_return_check(rc);
> + } while (ret <= 0 && ++retries < 100);
> +}
[Severity: Medium]
Similar to the configure path, does this loop needlessly spin on deterministic
errors like -EINVAL and -EPERM?
> [ ... ]
> +static void htm_event_stop(struct perf_event *event, int flags)
> +{
> +}
[Severity: Medium]
Does providing an empty stop callback prevent users from disabling tracing via
the standard perf API? ioctl(PERF_EVENT_IOC_DISABLE) relies on this callback
to pause the hardware.
> +static struct pmu htm_pmu = {
> + .task_ctx_nr = perf_invalid_context,
> +
> + .name = "htm",
> + .attr_groups = attr_groups,
> + .event_init = htm_event_init,
> + .add = htm_event_add,
> + .del = htm_event_del,
> + .read = htm_event_read,
> + .start = htm_event_start,
> + .stop = htm_event_stop,
> + .capabilities = PERF_PMU_CAP_NO_EXCLUDE | PERF_PMU_CAP_EXCLUSIVE,
> +};
[Severity: High]
Does this system-wide PMU require a cpumask attribute to prevent concurrent
access? Without a cpumask, running perf stat -a might attempt to open and
start the HTM event on every CPU simultaneously, leading to race conditions
on the hypervisor side for the same node/chip target.
> +static int htm_init(void)
> +{
> + int r;
> +
> + /* This driver is intended only for L1 host. */
> + if (is_kvm_guest()) {
> + pr_debug("Only supported for L1 host system\n");
> + return -ENODEV;
> + }
> +
> + r = perf_pmu_register(&htm_pmu, htm_pmu.name, -1);
[Severity: High]
Can this PMU register on bare-metal systems (PowerNV/OPAL) where the PAPR
hypervisor isn't present? It might be necessary to verify the presence of the
hypervisor using firmware_has_feature(FW_FEATURE_LPAR) to prevent system
traps or panics when executing H_HTM.
> + if (r)
> + return r;
> +
> + return 0;
> +}
> +
> +device_initcall(htm_init);
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260701083806.79358-1-atrajeev@linux.ibm.com?part=1
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/5] powerpc/htm: Add support to setup and free aux buffer for capturing HTM data
2026-07-01 8:38 ` [PATCH 2/5] powerpc/htm: Add support to setup and free aux buffer for capturing HTM data Athira Rajeev
@ 2026-07-01 8:50 ` sashiko-bot
0 siblings, 0 replies; 9+ messages in thread
From: sashiko-bot @ 2026-07-01 8:50 UTC (permalink / raw)
To: Athira Rajeev; +Cc: linux-perf-users
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] The driver violates the perf AUX ring buffer API by writing trace data to independent offsets, ignoring the core perf ring buffer head.
- [Medium] Signed integer overflow when calculating the AUX buffer size leads to a massive 64-bit value.
--
commit 10ce4df3e9571041276c72fea1e10e3955d5eea1
Author: Athira Rajeev <atrajeev@linux.ibm.com>
Subject: powerpc/htm: Add support to setup and free aux buffer for capturing HTM data
This commit adds PMU callbacks for setup_aux and free_aux to enable
capturing HTM trace data into a perf auxiliary buffer. It allocates
memory and records trace data directly into the provided pages for
later post-processing.
> diff --git a/arch/powerpc/perf/htm-perf.c b/arch/powerpc/perf/htm-perf.c
> index e22a7fdce2f5d..ae7f469b6840d 100644
> --- a/arch/powerpc/perf/htm-perf.c
> +++ b/arch/powerpc/perf/htm-perf.c
[ ... ]
> @@ -126,6 +143,74 @@ static ssize_t htm_return_check(int rc)
> return -EINVAL;
> }
>
> +static int htm_dump_sample_data(struct perf_event *event)
> +{
[ ... ]
> + /*
> + * Invoke H_HTM call with:
> + * - operation as htm dump (H_HTM_OP_DUMP_DATA)
> + * - last three values are address, size and offset
> + */
> + if (aux_buf->collect_htm_trace) {
> + do {
> + rc = htm_hcall_wrapper(htmflags, nodeindex, nodalchipindex, coreindexonchip,
> + htmtype, H_HTM_OP_DUMP_DATA, virt_to_phys(aux_buf->base),
> + (aux_buf->nr_pages * PAGE_SIZE), aux_buf->head);
> + ret = htm_return_check(rc);
> + } while (ret == -EBUSY && ++retries < 100);
> +
> + if (ret > 0) {
> + aux_buf->head += (aux_buf->nr_pages * PAGE_SIZE);
> + perf_aux_output_end(&htm_ctx->handle, (aux_buf->nr_pages * PAGE_SIZE));
[Severity: High]
Does this code write trace data to independent offsets, ignoring the core
perf ring buffer head?
If the user does not consume data fast enough, perf_aux_output_end() will
truncate the write to the available space. Because the driver ignores
htm_ctx->handle.head (the true write offset within the ring buffer) and
tracks its own offsets via aux_buf->base and aux_buf->head, could any
truncation desynchronize the driver from the perf core?
This could lead to subsequent writes going to the wrong offset while
userspace is told the data is at a different offset, resulting in silent
corruption of the perf AUX trace data.
> + } else {
> + aux_buf->collect_htm_trace = 0;
> + perf_aux_output_end(&htm_ctx->handle, 0);
> + }
> + }
> +
> + return ret;
> +}
[ ... ]
> +static void *htm_setup_aux(struct perf_event *event, void **pages,
> + int nr_pages, bool snapshot)
> +{
> + int cpu = event->cpu;
> + struct htm_pmu_buf *buf;
> +
> + /* We need at least one page for this to work. */
> + if (!nr_pages)
> + return NULL;
> +
> + if (cpu == -1)
> + cpu = raw_smp_processor_id();
> +
> + buf = kzalloc_node(sizeof(*buf), GFP_KERNEL, cpu_to_node(cpu));
> + if (!buf)
> + return NULL;
> +
> + buf->base = pages[0];
> +
> + if (!buf->base) {
> + kfree(buf);
> + return NULL;
> + }
> +
> + buf->nr_pages = nr_pages;
> + buf->snapshot = false;
> + buf->size = nr_pages << PAGE_SHIFT;
[Severity: Medium]
Can the size calculation overflow for large values of nr_pages?
Since both nr_pages and PAGE_SHIFT evaluate as 32-bit signed integers,
a large AUX buffer allocation (e.g., nr_pages = 32768) could overflow into
the sign bit. When assigned to the u64 size field, this would sign-extend
to a massive 64-bit value.
Could this result in bypassing internal bounds checks when the size is
later returned to the perf subsystem?
> + buf->head = 0;
> + buf->head_size = 0;
> + buf->htm_stopped = 0;
> + buf->collect_htm_trace = 1;
> + return buf;
> +}
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260701083806.79358-1-atrajeev@linux.ibm.com?part=2
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/5] powerpc/perf: Capture the HTM memory configuration as part of perf data
2026-07-01 8:38 ` [PATCH 3/5] powerpc/perf: Capture the HTM memory configuration as part of perf data Athira Rajeev
@ 2026-07-01 9:02 ` sashiko-bot
0 siblings, 0 replies; 9+ messages in thread
From: sashiko-bot @ 2026-07-01 9:02 UTC (permalink / raw)
To: Athira Rajeev; +Cc: linux-perf-users
Thank you for your contribution! Sashiko AI review found 6 potential issue(s) to consider:
New issues:
- [High] Missing bounds check on the hypervisor-provided `num_entries` allows `to_copy` to exceed `PAGE_SIZE`, leading to an out-of-bounds read from `htm_mem_buf`.
- [High] The PMU driver assumes the AUX buffer pages are virtually contiguous and uses `memcpy` to write sequentially into them, causing an out-of-bounds write that corrupts kernel memory.
- [High] Chunks of memory configuration data are permanently lost if there is insufficient space in the buffer (`to_copy > space_to_end`).
- [Medium] The implementation fails to write the starting `PERF_SAMPLE_RAW` record and exports incorrect size data in the ending record, violating the documented API contract.
- [Medium] When wrapping the memory map trace data at the 50% threshold, the remaining padding space in the buffer is left uninitialized but committed to userspace.
Pre-existing issues:
- [Critical] Pre-existing logic in `htm_dump_sample_data()` passes a non-contiguous physical address base to the hypervisor, causing massive physical memory corruption during the trace dump.
--
commit 651e2e981cb409f60d68fe643ec0cd48a6edd6e7
Author: Athira Rajeev <atrajeev@linux.ibm.com>
Subject: powerpc/perf: Capture the HTM memory configuration as part of perf data
This commit modifies the HTM PMU driver to capture system memory configuration
data in the perf AUX buffer. It uses the H_HTM_OP_DUMP_SYSMEM_CONF hypervisor
call to retrieve the data and demarcates it using PERF_SAMPLE_RAW records.
> When monitoring the HTM pmu, auxiliary buffer captures
> the "trace" data and SystemMemory Configuration. This
> will be post processed later using perf. The size of memory
> mapping data captured depends on how large is the system
> and how much memory is allocated. To help with relating
> and identifying the start of memory mapping data in the
> auxiliary buffer, insert two PERF_SAMPLE_RAW records in the
> ring buffer. First PERF_SAMPLE_RAW record will mark the
> beginning of system memory mapping data in aux buffer. And second
> PERF_SAMPLE_RAW record will be written at the end to make the
> end of the data in aux buffer and also contains the total size
> of the memory map data. These sample raw records
> will be used during post processing in perf report.
[Severity: Medium]
Is the start PERF_SAMPLE_RAW record missing entirely from the code, contrary
to this commit message?
> diff --git a/arch/powerpc/perf/htm-perf.c b/arch/powerpc/perf/htm-perf.c
> index ae7f469b6840d..fe458bc3ec05d 100644
> --- a/arch/powerpc/perf/htm-perf.c
> +++ b/arch/powerpc/perf/htm-perf.c
[ ... ]
> @@ -143,6 +147,86 @@ static ssize_t htm_return_check(int rc)
> return -EINVAL;
> }
>
> +static int htm_collect_memory_config(struct perf_event *event,
> + struct htm_pmu_buf *aux_buf)
> +{
> + struct perf_sample_data data;
> + struct perf_raw_record raw;
> + struct pt_regs regs;
> + u64 *num_entries;
> + u64 to_copy = 0;
> + int htm_val;
> + long rc;
> + int ret;
> + int retries = 0;
> + size_t size;
> + size_t space_to_end = aux_buf->size - aux_buf->mem_head;
> +
> + /* Capture HTM system memory configuration in aux buffer */
> + do {
> + rc = htm_hcall_wrapper(htmflags, 0, 0, 0,
> + 0, H_HTM_OP_DUMP_SYSMEM_CONF, virt_to_phys(aux_buf->htm_mem_buf),
> + PAGE_SIZE, aux_buf->mem_start);
> + ret = htm_return_check(rc);
> + } while (ret == -EBUSY && ++retries < 100);
> +
> + /* Return once there is no more data in HTM buffer */
> + if (ret <= 0) {
> + perf_sample_data_init(&data, 0, event->hw.last_period);
> + memset(&raw, 0, sizeof(raw));
> + memset(®s, 0, sizeof(regs));
> +
> + htm_val = (aux_buf->head/((aux_buf->nr_pages * PAGE_SIZE)));
> + raw.frag.data = &htm_val;
> + raw.frag.size = sizeof(htm_val);
[Severity: Medium]
Does this export the buffer wrap count instead of the total size of the
memory map data in the end record?
> +
> + aux_buf->collect_htm_mem = 0;
> + perf_sample_save_raw_data(&data, event, &raw);
> + perf_event_overflow(event, &data, ®s);
> + return 0;
> + }
> +
> + /*
> + * Find how much data to copy to aux buffer
> + * If hcall returned H_PARTIAL, set mem_start to
> + * indicate next offset of memory to read from
> + */
> + num_entries = aux_buf->htm_mem_buf + 0x10;
> + aux_buf->mem_start = be64_to_cpu(*(u64 *)(aux_buf->htm_mem_buf + 0x8));
[Severity: High]
Are chunks of memory configuration data permanently lost here?
aux_buf->mem_start is unconditionally advanced before checking if to_copy
fits in space_to_end.
If the else branch is taken at the end of this function, does the next PMU
interrupt fetch data from the already-advanced offset, dropping the current
chunk?
> +
> + to_copy = 32 + (be64_to_cpu(*num_entries) * 32);
> +
> + if (to_copy <= space_to_end) {
[Severity: High]
Does this code allow an out-of-bounds read of aux_buf->htm_mem_buf?
Since to_copy is calculated from the hypervisor-provided num_entries, if the
hypervisor returns a large value, could to_copy exceed the PAGE_SIZE
allocation of aux_buf->htm_mem_buf when it is passed to memcpy() later?
> + if ((to_copy + aux_buf->mem_head) >= ((aux_buf->nr_pages * PAGE_SIZE)/2)) {
> + /*
> + * Crossing 50% threshold - flush and wrap.
> + * Write current chunk, then pad to end of buffer.
> + * This ensures next write starts at beginning with
> + * perf head also at beginning (synchronized).
> + */
> + memcpy(aux_buf->base + aux_buf->mem_head, aux_buf->htm_mem_buf, to_copy);
> + aux_buf->mem_head = 0;
> +
> + /*
> + * Return space_to_end to include padding.
> + * Perf will advance head to end (wrapping to 0),
> + * matching our mem_head position.
> + */
> + size = space_to_end;
[Severity: Medium]
Does this leak uninitialized padding to userspace?
When returning space_to_end, perf will advance the ring buffer head to the
end, but the gap of space_to_end - to_copy bytes is not zeroed out before
being committed.
> + } else {
> + /* Normal case - chunk fits without crossing threshold */
> + memcpy(aux_buf->base + aux_buf->mem_head, aux_buf->htm_mem_buf, to_copy);
[Severity: High]
Does this memcpy() write out of bounds?
Since aux_buf->base is only initialized to the first page in htm_setup_aux(),
the AUX buffer pages may not be contiguous in virtual memory.
Could this write past the end of the first page?
> + aux_buf->mem_head += to_copy;
> + size = to_copy;
> + }
> + } else {
> + return 0;
> + }
> +
> + /* Return non-zero to indicate that one record is written to aux buffer */
> + return size;
> +}
[ ... ]
> @@ -202,12 +286,17 @@ static int htm_dump_sample_data(struct perf_event *event)
[Severity: Critical]
This is a pre-existing issue, but does the htm_hcall_wrapper() call just
above this corrupt physical memory?
Because aux_buf->base is just pages[0] from rb_alloc_aux(), the memory is
only physically contiguous for the length of that single allocation chunk.
Passing virt_to_phys(aux_buf->base) with the full
aux_buf->nr_pages * PAGE_SIZE size could cause the hypervisor to write past
the allocated chunk into adjacent physical frames.
> if (ret > 0) {
> aux_buf->head += (aux_buf->nr_pages * PAGE_SIZE);
> perf_aux_output_end(&htm_ctx->handle, (aux_buf->nr_pages * PAGE_SIZE));
> + return ret;
> } else {
> aux_buf->collect_htm_trace = 0;
> - perf_aux_output_end(&htm_ctx->handle, 0);
> }
> }
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260701083806.79358-1-atrajeev@linux.ibm.com?part=3
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-07-01 9:02 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 8:38 [PATCH 0/5] powerpc/htm: Add interface to expose HTM trace data via perf Athira Rajeev
2026-07-01 8:38 ` [PATCH 1/5] " Athira Rajeev
2026-07-01 8:50 ` sashiko-bot
2026-07-01 8:38 ` [PATCH 2/5] powerpc/htm: Add support to setup and free aux buffer for capturing HTM data Athira Rajeev
2026-07-01 8:50 ` sashiko-bot
2026-07-01 8:38 ` [PATCH 3/5] powerpc/perf: Capture the HTM memory configuration as part of perf data Athira Rajeev
2026-07-01 9:02 ` sashiko-bot
2026-07-01 8:38 ` [PATCH 4/5] docs: ABI: sysfs-bus-event_source-devices-htm: Document sysfs event format entries for htm pmu Athira Rajeev
2026-07-01 8:38 ` [PATCH 5/5] powerpc/perf/htm: Add documentation for Hardware Trace Macro PMU Athira Rajeev
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox