public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 0/8] drm/panthor: Add performance counters with manual sampling mode
@ 2024-12-11 16:50 Lukas Zapolskas
  2024-12-11 16:50 ` [RFC v2 1/8] drm/panthor: Add performance counter uAPI Lukas Zapolskas
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2024-12-11 16:50 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Adrián Larumbe
  Cc: dri-devel, linux-kernel, Mihail Atanassov, nd, Lukas Zapolskas

Hello,

This patch set implements initial support for performance counter
sampling in Panthor, as a follow-up for Adrián Larumbe's patch
set [1].

Existing performance counter workflows, such as those in game
engines, and user-space power models/governor implementations
require the ability to simultaneously obtain counter data. The
hardware and firmware interfaces support a single global
configuration, meaning the kernel must allow for the multiplexing.
It is also in the best position to supplement the counter data
with contextual information about elapsed sampling periods,
information on the power state transitions undergone during
the sampling period, and cycles elapsed on specific clocks chosen
by the integrator.

Each userspace client creates a session, providing an enable
mask of counter values it requires, a BO for a ring buffer,
and a separate BO for the insert and extract indices, along with
an eventfd to signal counter capture, all of which are kept fixed
for the lifetime of the session. When emitting a sample for a
session, counters that were not requested are stripped out,
and non-counter information needed to interpret counter values
is added to either the sample header, or the block header,
which are stored in-line with the counter values in the sample.

The proposed uAPI specifies two major sources of supplemental
information:
- coarse-grained block state transitions are provided on newer
  FW versions which support the metadata block, a FW-provided
  counter block which indicates the reason a sample was taken
  when entering or existing a non-counting region, or when a
  shader core has powered down.
- the clock assignments to individual blocks is done by
  integrators, and in order to normalize counter values
  which count cycles, userspace must know both the clock
  cycles elapsed over the sampling period, and which
  of the clocks that particular block is associated
  with.

All of the sessions are then aggregated by the sampler, which
handles the programming of the FW interface and subsequent
handling of the samples coming from FW.

[1]: https://lore.kernel.org/lkml/20240305165820.585245-1-adrian.larumbe@collabora.com/T/#m67d1f89614fe35dc0560e8304d6731eb1a6942b6

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
Co-developed-by: Mihail Atanassov <mihail.atanassov@arm.com>
Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
Co-developed-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>

Adrián Larumbe (1):
  drm/panthor: Implement the counter sampler and sample handling

Lukas Zapolskas (7):
  drm/panthor: Add performance counter uAPI
  drm/panthor: Add DEV_QUERY.PERF_INFO handling for Gx10
  drm/panthor: Add panthor_perf_init and panthor_perf_unplug
  drm/panthor: Add panthor perf ioctls
  drm/panthor: Introduce sampling sessions to handle userspace clients
  drm/panthor: Add suspend/resume handling for the performance counters
  drm/panthor: Expose the panthor perf ioctls

 drivers/gpu/drm/panthor/Makefile         |    1 +
 drivers/gpu/drm/panthor/panthor_device.c |   10 +
 drivers/gpu/drm/panthor/panthor_device.h |   11 +-
 drivers/gpu/drm/panthor/panthor_drv.c    |  167 +-
 drivers/gpu/drm/panthor/panthor_fw.c     |    9 +
 drivers/gpu/drm/panthor/panthor_fw.h     |   11 +-
 drivers/gpu/drm/panthor/panthor_perf.c   | 1773 ++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_perf.h   |   38 +
 include/uapi/drm/panthor_drm.h           |  538 +++++++
 9 files changed, 2553 insertions(+), 5 deletions(-)
 create mode 100644 drivers/gpu/drm/panthor/panthor_perf.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_perf.h

--
2.25.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC v2 1/8] drm/panthor: Add performance counter uAPI
  2024-12-11 16:50 [RFC v2 0/8] drm/panthor: Add performance counters with manual sampling mode Lukas Zapolskas
@ 2024-12-11 16:50 ` Lukas Zapolskas
  2025-01-27  9:47   ` Adrián Larumbe
  2024-12-11 16:50 ` [RFC v2 2/8] drm/panthor: Add DEV_QUERY.PERF_INFO handling for Gx10 Lukas Zapolskas
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Lukas Zapolskas @ 2024-12-11 16:50 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Adrián Larumbe
  Cc: dri-devel, linux-kernel, Mihail Atanassov, nd, Lukas Zapolskas

This patch extends the DEV_QUERY ioctl to return information about the
performance counter setup for userspace, and introduces the new
ioctl DRM_PANTHOR_PERF_CONTROL in order to allow for the sampling of
performance counters.

The new design is inspired by the perf aux ringbuffer, with the insert
and extract indices being mapped to userspace, allowing multiple samples
to be exposed at any given time. To avoid pointer chasing, the sample
metadata and block metadata are inline with the elements they
describe.

Userspace is responsible for passing in resources for samples to be
exposed, including the event file descriptor for notification of new
sample availability, the ringbuffer BO to store samples, and the control
BO along with the offset for mapping the insert and extract indices.
Though these indices are only a total of 8 bytes, userspace can then
reuse the same physical page for tracking the state of multiple buffers
by giving different offsets from the BO start to map them.

Co-developed-by: Mihail Atanassov <mihail.atanassov@arm.com>
Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
---
 include/uapi/drm/panthor_drm.h | 487 +++++++++++++++++++++++++++++++++
 1 file changed, 487 insertions(+)

diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index 87c9cb555dd1..8a431431da6b 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -127,6 +127,9 @@ enum drm_panthor_ioctl_id {
 
 	/** @DRM_PANTHOR_TILER_HEAP_DESTROY: Destroy a tiler heap. */
 	DRM_PANTHOR_TILER_HEAP_DESTROY,
+
+	/** @DRM_PANTHOR_PERF_CONTROL: Control a performance counter session. */
+	DRM_PANTHOR_PERF_CONTROL,
 };
 
 /**
@@ -170,6 +173,8 @@ enum drm_panthor_ioctl_id {
 	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_CREATE, tiler_heap_create)
 #define DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY \
 	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_DESTROY, tiler_heap_destroy)
+#define DRM_IOCTL_PANTHOR_PERF_CONTROL \
+	DRM_IOCTL_PANTHOR(WR, PERF_CONTROL, perf_control)
 
 /**
  * DOC: IOCTL arguments
@@ -268,6 +273,9 @@ enum drm_panthor_dev_query_type {
 	 * @DRM_PANTHOR_DEV_QUERY_GROUP_PRIORITIES_INFO: Query allowed group priorities information.
 	 */
 	DRM_PANTHOR_DEV_QUERY_GROUP_PRIORITIES_INFO,
+
+	/** @DRM_PANTHOR_DEV_QUERY_PERF_INFO: Query performance counter interface information. */
+	DRM_PANTHOR_DEV_QUERY_PERF_INFO,
 };
 
 /**
@@ -421,6 +429,120 @@ struct drm_panthor_group_priorities_info {
 	__u8 pad[3];
 };
 
+/**
+ * enum drm_panthor_perf_feat_flags - Performance counter configuration feature flags.
+ */
+enum drm_panthor_perf_feat_flags {
+	/** @DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT: Coarse-grained block states are supported. */
+	DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT = 1 << 0,
+};
+
+/**
+ * enum drm_panthor_perf_block_type - Performance counter supported block types.
+ */
+enum drm_panthor_perf_block_type {
+	/** @DRM_PANTHOR_PERF_BLOCK_FW: The FW counter block. */
+	DRM_PANTHOR_PERF_BLOCK_FW = 1,
+
+	/** @DRM_PANTHOR_PERF_BLOCK_CSG: A CSG counter block. */
+	DRM_PANTHOR_PERF_BLOCK_CSG,
+
+	/** @DRM_PANTHOR_PERF_BLOCK_CSHW: The CSHW counter block. */
+	DRM_PANTHOR_PERF_BLOCK_CSHW,
+
+	/** @DRM_PANTHOR_PERF_BLOCK_TILER: The tiler counter block. */
+	DRM_PANTHOR_PERF_BLOCK_TILER,
+
+	/** @DRM_PANTHOR_PERF_BLOCK_MEMSYS: A memsys counter block. */
+	DRM_PANTHOR_PERF_BLOCK_MEMSYS,
+
+	/** @DRM_PANTHOR_PERF_BLOCK_SHADER: A shader core counter block. */
+	DRM_PANTHOR_PERF_BLOCK_SHADER,
+};
+
+/**
+ * enum drm_panthor_perf_clock - Identifier of the clock used to produce the cycle count values
+ * in a given block.
+ *
+ * Since the integrator has the choice of using one or more clocks, there may be some confusion
+ * as to which blocks are counted by which clock values unless this information is explicitly
+ * provided as part of every block sample. Not every single clock here can be used: in the simplest
+ * case, all cycle counts will be associated with the top-level clock.
+ */
+enum drm_panthor_perf_clock {
+	/** @DRM_PANTHOR_PERF_CLOCK_TOPLEVEL: Top-level CSF clock. */
+	DRM_PANTHOR_PERF_CLOCK_TOPLEVEL,
+
+	/**
+	 * @DRM_PANTHOR_PERF_CLOCK_COREGROUP: Core group clock, responsible for the MMU, L2
+	 * caches and the tiler.
+	 */
+	DRM_PANTHOR_PERF_CLOCK_COREGROUP,
+
+	/** @DRM_PANTHOR_PERF_CLOCK_SHADER: Clock for the shader cores. */
+	DRM_PANTHOR_PERF_CLOCK_SHADER,
+};
+
+/**
+ * struct drm_panthor_perf_info - Performance counter interface information
+ *
+ * Structure grouping all queryable information relating to the performance counter
+ * interfaces.
+ */
+struct drm_panthor_perf_info {
+	/**
+	 * @counters_per_block: The number of 8-byte counters available in a block.
+	 */
+	__u32 counters_per_block;
+
+	/**
+	 * @sample_header_size: The size of the header struct available at the beginning
+	 * of every sample.
+	 */
+	__u32 sample_header_size;
+
+	/**
+	 * @block_header_size: The size of the header struct inline with the counters for a
+	 * single block.
+	 */
+	__u32 block_header_size;
+
+	/** @flags: Combination of drm_panthor_perf_feat_flags flags. */
+	__u32 flags;
+
+	/**
+	 * @supported_clocks: Bitmask of the clocks supported by the GPU.
+	 *
+	 * Each bit represents a variant of the enum drm_panthor_perf_clock.
+	 *
+	 * For the same GPU, different implementers may have different clocks for the same hardware
+	 * block. At the moment, up to four clocks are supported, and any clocks that are present
+	 * will be reported here.
+	 */
+	__u32 supported_clocks;
+
+	/** @fw_blocks: Number of FW blocks available. */
+	__u32 fw_blocks;
+
+	/** @csg_blocks: Number of CSG blocks available. */
+	__u32 csg_blocks;
+
+	/** @cshw_blocks: Number of CSHW blocks available. */
+	__u32 cshw_blocks;
+
+	/** @tiler_blocks: Number of tiler blocks available. */
+	__u32 tiler_blocks;
+
+	/** @memsys_blocks: Number of memsys blocks available. */
+	__u32 memsys_blocks;
+
+	/** @shader_blocks: Number of shader core blocks available. */
+	__u32 shader_blocks;
+
+	/** @pad: MBZ. */
+	__u32 pad;
+};
+
 /**
  * struct drm_panthor_dev_query - Arguments passed to DRM_PANTHOR_IOCTL_DEV_QUERY
  */
@@ -1010,6 +1132,371 @@ struct drm_panthor_tiler_heap_destroy {
 	__u32 pad;
 };
 
+/**
+ * DOC: Performance counter decoding in userspace.
+ *
+ * Each sample will be exposed to userspace in the following manner:
+ *
+ * +--------+--------+------------------------+--------+-------------------------+-----+
+ * | Sample | Block  |        Block           | Block  |         Block           | ... |
+ * | header | header |        counters        | header |         counters        |     |
+ * +--------+--------+------------------------+--------+-------------------------+-----+
+ *
+ * Each sample will start with a sample header of type @struct drm_panthor_perf_sample header,
+ * providing sample-wide information like the start and end timestamps, the counter set currently
+ * configured, and any errors that may have occurred during sampling.
+ *
+ * After the fixed size header, the sample will consist of blocks of
+ * 64-bit @drm_panthor_dev_query_perf_info::counters_per_block counters, each prefaced with a
+ * header of its own, indicating source block type, as well as the cycle count needed to normalize
+ * cycle values within that block, and a clock source identifier.
+ */
+
+/**
+ * enum drm_panthor_perf_block_state - Bitmask of the power and execution states that an individual
+ * hardware block went through in a sampling period.
+ *
+ * Because the sampling period is controlled from userspace, the block may undergo multiple
+ * state transitions, so this must be interpreted as one or more such transitions occurring.
+ */
+enum drm_panthor_perf_block_state {
+	/**
+	 * @DRM_PANTHOR_PERF_BLOCK_STATE_UNKNOWN: The state of this block was unknown during
+	 * the sampling period.
+	 */
+	DRM_PANTHOR_PERF_BLOCK_STATE_UNKNOWN = 0,
+
+	/**
+	 * @DRM_PANTHOR_PERF_BLOCK_STATE_ON: This block was powered on for some or all of
+	 * the sampling period.
+	 */
+	DRM_PANTHOR_PERF_BLOCK_STATE_ON = 1 << 0,
+
+	/**
+	 * @DRM_PANTHOR_PERF_BLOCK_STATE_OFF: This block was powered off for some or all of the
+	 * sampling period.
+	 */
+	DRM_PANTHOR_PERF_BLOCK_STATE_OFF = 1 << 1,
+
+	/**
+	 * @DRM_PANTHOR_PERF_BLOCK_STATE_AVAILABLE: This block was available for execution for
+	 * some or all of the sampling period.
+	 */
+	DRM_PANTHOR_PERF_BLOCK_STATE_AVAILABLE = 1 << 2,
+	/**
+	 * @DRM_PANTHOR_PERF_BLOCK_STATE_UNAVAILABLE: This block was unavailable for execution for
+	 * some or all of the sampling period.
+	 */
+	DRM_PANTHOR_PERF_BLOCK_STATE_UNAVAILABLE = 1 << 3,
+
+	/**
+	 * @DRM_PANTHOR_PERF_BLOCK_STATE_NORMAL: This block was executing in normal mode
+	 * for some or all of the sampling period.
+	 */
+	DRM_PANTHOR_PERF_BLOCK_STATE_NORMAL = 1 << 4,
+
+	/**
+	 * @DRM_PANTHOR_PERF_BLOCK_STATE_PROTECTED: This block was executing in protected mode
+	 * for some or all of the sampling period.
+	 */
+	DRM_PANTHOR_PERF_BLOCK_STATE_PROTECTED = 1 << 5,
+};
+
+/**
+ * struct drm_panthor_perf_block_header - Header present before every block in the
+ * sample ringbuffer.
+ */
+struct drm_panthor_perf_block_header {
+	/** @block_type: Type of the block. */
+	__u8 block_type;
+
+	/** @block_idx: Block index. */
+	__u8 block_idx;
+
+	/**
+	 * @block_states: Coarse-grained block transitions, bitmask of enum
+	 * drm_panthor_perf_block_states.
+	 */
+	__u8 block_states;
+
+	/**
+	 * @clock: Clock used to produce the cycle count for this block, taken from
+	 * enum drm_panthor_perf_clock. The cycle counts are stored in the sample header.
+	 */
+	__u8 clock;
+
+	/** @pad: MBZ. */
+	__u8 pad[4];
+
+	/** @enable_mask: Bitmask of counters requested during the session setup. */
+	__u64 enable_mask[2];
+};
+
+/**
+ * enum drm_panthor_perf_sample_flags - Sample-wide events that occurred over the sampling
+ * period.
+ */
+enum drm_panthor_perf_sample_flags {
+	/**
+	 * @DRM_PANTHOR_PERF_SAMPLE_OVERFLOW: This sample contains overflows due to the duration
+	 * of the sampling period.
+	 */
+	DRM_PANTHOR_PERF_SAMPLE_OVERFLOW = 1 << 0,
+
+	/**
+	 * @DRM_PANTHOR_PERF_SAMPLE_ERROR: This sample encountered an error condition during
+	 * the sample duration.
+	 */
+	DRM_PANTHOR_PERF_SAMPLE_ERROR = 1 << 1,
+};
+
+/**
+ * struct drm_panthor_perf_sample_header - Header present before every sample.
+ */
+struct drm_panthor_perf_sample_header {
+	/**
+	 * @timestamp_start_ns: Earliest timestamp that values in this sample represent, in
+	 * nanoseconds. Derived from CLOCK_MONOTONIC_RAW.
+	 */
+	__u64 timestamp_start_ns;
+
+	/**
+	 * @timestamp_end_ns: Latest timestamp that values in this sample represent, in
+	 * nanoseconds. Derived from CLOCK_MONOTONIC_RAW.
+	 */
+	__u64 timestamp_end_ns;
+
+	/** @block_set: Set of performance counter blocks. */
+	__u8 block_set;
+
+	/** @pad: MBZ. */
+	__u8 pad[3];
+
+	/** @flags: Current sample flags, combination of drm_panthor_perf_sample_flags. */
+	__u32 flags;
+
+	/**
+	 * @user_data: User data provided as part of the command that triggered this sample.
+	 *
+	 * - Automatic samples (periodic ones or those around non-counting periods or power state
+	 * transitions) will be tagged with the user_data provided as part of the
+	 * DRM_PANTHOR_PERF_COMMAND_START call.
+	 * - Manual samples will be tagged with the user_data provided with the
+	 * DRM_PANTHOR_PERF_COMMAND_SAMPLE call.
+	 * - A session's final automatic sample will be tagged with the user_data provided with the
+	 * DRM_PANTHOR_PERF_COMMAND_STOP call.
+	 */
+	__u64 user_data;
+
+	/**
+	 * @toplevel_clock_cycles: The number of cycles elapsed between
+	 * drm_panthor_perf_sample_header::timestamp_start_ns and
+	 * drm_panthor_perf_sample_header::timestamp_end_ns on the top-level clock if the
+	 * corresponding bit is set in drm_panthor_perf_info::supported_clocks.
+	 */
+	__u64 toplevel_clock_cycles;
+
+	/**
+	 * @coregroup_clock_cycles: The number of cycles elapsed between
+	 * drm_panthor_perf_sample_header::timestamp_start_ns and
+	 * drm_panthor_perf_sample_header::timestamp_end_ns on the coregroup clock if the
+	 * corresponding bit is set in drm_panthor_perf_info::supported_clocks.
+	 */
+	__u64 coregroup_clock_cycles;
+
+	/**
+	 * @shader_clock_cycles: The number of cycles elapsed between
+	 * drm_panthor_perf_sample_header::timestamp_start_ns and
+	 * drm_panthor_perf_sample_header::timestamp_end_ns on the shader core clock if the
+	 * corresponding bit is set in drm_panthor_perf_info::supported_clocks.
+	 */
+	__u64 shader_clock_cycles;
+};
+
+/**
+ * enum drm_panthor_perf_command - Command type passed to the DRM_PANTHOR_PERF_CONTROL
+ * IOCTL.
+ */
+enum drm_panthor_perf_command {
+	/** @DRM_PANTHOR_PERF_COMMAND_SETUP: Create a new performance counter sampling context. */
+	DRM_PANTHOR_PERF_COMMAND_SETUP,
+
+	/** @DRM_PANTHOR_PERF_COMMAND_TEARDOWN: Teardown a performance counter sampling context. */
+	DRM_PANTHOR_PERF_COMMAND_TEARDOWN,
+
+	/** @DRM_PANTHOR_PERF_COMMAND_START: Start a sampling session on the indicated context. */
+	DRM_PANTHOR_PERF_COMMAND_START,
+
+	/** @DRM_PANTHOR_PERF_COMMAND_STOP: Stop the sampling session on the indicated context. */
+	DRM_PANTHOR_PERF_COMMAND_STOP,
+
+	/**
+	 * @DRM_PANTHOR_PERF_COMMAND_SAMPLE: Request a manual sample on the indicated context.
+	 *
+	 * When the sampling session is configured with a non-zero sampling frequency, any
+	 * DRM_PANTHOR_PERF_CONTROL calls with this command will be ignored and return an
+	 * -EINVAL.
+	 */
+	DRM_PANTHOR_PERF_COMMAND_SAMPLE,
+};
+
+/**
+ * struct drm_panthor_perf_control - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL.
+ */
+struct drm_panthor_perf_control {
+	/** @cmd: Command from enum drm_panthor_perf_command. */
+	__u32 cmd;
+
+	/**
+	 * @handle: session handle.
+	 *
+	 * Returned by the DRM_PANTHOR_PERF_COMMAND_SETUP call.
+	 * It must be used in subsequent commands for the same context.
+	 */
+	__u32 handle;
+
+	/**
+	 * @size: size of the command structure.
+	 *
+	 * If the pointer is NULL, the size is updated by the driver to provide the size of the
+	 * output structure. If the pointer is not NULL, the driver will only copy min(size,
+	 * struct_size) to the pointer and update the size accordingly.
+	 */
+	__u64 size;
+
+	/** @pointer: user pointer to a command type struct. */
+	__u64 pointer;
+};
+
+
+/**
+ * struct drm_panthor_perf_cmd_setup - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
+ * when the DRM_PANTHOR_PERF_COMMAND_SETUP command is specified.
+ */
+struct drm_panthor_perf_cmd_setup {
+	/**
+	 * @block_set: Set of performance counter blocks.
+	 *
+	 * This is a global configuration and only one set can be active at a time. If
+	 * another client has already requested a counter set, any further requests
+	 * for a different counter set will fail and return an -EBUSY.
+	 *
+	 * If the requested set does not exist, the request will fail and return an -EINVAL.
+	 */
+	__u8 block_set;
+
+	/** @pad: MBZ. */
+	__u8 pad[7];
+
+	/** @fd: eventfd for signalling the availability of a new sample. */
+	__u32 fd;
+
+	/** @ringbuf_handle: Handle to the BO to write perf counter sample to. */
+	__u32 ringbuf_handle;
+
+	/**
+	 * @control_handle: Handle to the BO containing a contiguous 16 byte range, used for the
+	 * insert and extract indices for the ringbuffer.
+	 */
+	__u32 control_handle;
+
+	/**
+	 * @sample_slots: The number of slots available in the userspace-provided BO. Must be
+	 * a power of 2.
+	 *
+	 * If sample_slots * sample_size does not match the BO size, the setup request will fail.
+	 */
+	__u32 sample_slots;
+
+	/**
+	 * @control_offset: Offset into the control BO where the insert and extract indices are
+	 * located.
+	 */
+	__u64 control_offset;
+
+	/**
+	 * @sample_freq_ns: Period between automatic counter sample collection in nanoseconds. Zero
+	 * disables automatic collection and all collection must be done through explicit calls
+	 * to DRM_PANTHOR_PERF_CONTROL.SAMPLE. Non-zero values will disable manual counter sampling
+	 * via the DRM_PANTHOR_PERF_COMMAND_SAMPLE command.
+	 *
+	 * This disables software-triggered periodic sampling, but hardware will still trigger
+	 * automatic samples on certain events, including shader core power transitions, and
+	 * entries to and exits from non-counting periods. The final stop command will also
+	 * trigger a sample to ensure no data is lost.
+	 */
+	__u64 sample_freq_ns;
+
+	/**
+	 * @fw_enable_mask: Bitmask of counters to request from the FW counter block. Any bits
+	 * past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
+	 */
+	__u64 fw_enable_mask[2];
+
+	/**
+	 * @csg_enable_mask: Bitmask of counters to request from the CSG counter blocks. Any bits
+	 * past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
+	 */
+	__u64 csg_enable_mask[2];
+
+	/**
+	 * @cshw_enable_mask: Bitmask of counters to request from the CSHW counter block. Any bits
+	 * past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
+	 */
+	__u64 cshw_enable_mask[2];
+
+	/**
+	 * @tiler_enable_mask: Bitmask of counters to request from the tiler counter block. Any
+	 * bits past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
+	 */
+	__u64 tiler_enable_mask[2];
+
+	/**
+	 * @memsys_enable_mask: Bitmask of counters to request from the memsys counter blocks. Any
+	 * bits past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
+	 */
+	__u64 memsys_enable_mask[2];
+
+	/**
+	 * @shader_enable_mask: Bitmask of counters to request from the shader core counter blocks.
+	 * Any bits past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
+	 */
+	__u64 shader_enable_mask[2];
+};
+
+/**
+ * struct drm_panthor_perf_cmd_start - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
+ * when the DRM_PANTHOR_PERF_COMMAND_START command is specified.
+ */
+struct drm_panthor_perf_cmd_start {
+	/**
+	 * @user_data: User provided data that will be attached to automatic samples collected
+	 * until the next DRM_PANTHOR_PERF_COMMAND_STOP.
+	 */
+	__u64 user_data;
+};
+
+/**
+ * struct drm_panthor_perf_cmd_stop - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
+ * when the DRM_PANTHOR_PERF_COMMAND_STOP command is specified.
+ */
+struct drm_panthor_perf_cmd_stop {
+	/**
+	 * @user_data: User provided data that will be attached to the automatic sample collected
+	 * at the end of this sampling session.
+	 */
+	__u64 user_data;
+};
+
+/**
+ * struct drm_panthor_perf_cmd_sample - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
+ * when the DRM_PANTHOR_PERF_COMMAND_SAMPLE command is specified.
+ */
+struct drm_panthor_perf_cmd_sample {
+	/** @user_data: User provided data that will be attached to the sample.*/
+	__u64 user_data;
+};
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v2 2/8] drm/panthor: Add DEV_QUERY.PERF_INFO handling for Gx10
  2024-12-11 16:50 [RFC v2 0/8] drm/panthor: Add performance counters with manual sampling mode Lukas Zapolskas
  2024-12-11 16:50 ` [RFC v2 1/8] drm/panthor: Add performance counter uAPI Lukas Zapolskas
@ 2024-12-11 16:50 ` Lukas Zapolskas
  2025-01-27  9:56   ` Adrián Larumbe
  2025-01-27 22:17   ` Adrián Larumbe
  2024-12-11 16:50 ` [RFC v2 3/8] drm/panthor: Add panthor_perf_init and panthor_perf_unplug Lukas Zapolskas
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2024-12-11 16:50 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Adrián Larumbe
  Cc: dri-devel, linux-kernel, Mihail Atanassov, nd, Lukas Zapolskas

This change adds the IOCTL to query data about the performance counter
setup. Some of this data was available via previous DEV_QUERY calls,
for instance for GPU info, but exposing it via PERF_INFO
minimizes the overhead of creating a single session to just the one
aggregate IOCTL.

To better align the FW interfaces with the arch spec, the patch also
renames perfcnt to prfcnt.

Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
---
 drivers/gpu/drm/panthor/Makefile         |  1 +
 drivers/gpu/drm/panthor/panthor_device.h |  3 ++
 drivers/gpu/drm/panthor/panthor_drv.c    | 11 +++++-
 drivers/gpu/drm/panthor/panthor_fw.c     |  4 ++
 drivers/gpu/drm/panthor/panthor_fw.h     |  4 ++
 drivers/gpu/drm/panthor/panthor_perf.c   | 47 ++++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_perf.h   | 12 ++++++
 7 files changed, 81 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/panthor/panthor_perf.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_perf.h

diff --git a/drivers/gpu/drm/panthor/Makefile b/drivers/gpu/drm/panthor/Makefile
index 15294719b09c..0df9947f3575 100644
--- a/drivers/gpu/drm/panthor/Makefile
+++ b/drivers/gpu/drm/panthor/Makefile
@@ -9,6 +9,7 @@ panthor-y := \
 	panthor_gpu.o \
 	panthor_heap.o \
 	panthor_mmu.o \
+	panthor_perf.o \
 	panthor_sched.o
 
 obj-$(CONFIG_DRM_PANTHOR) += panthor.o
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 0e68f5a70d20..636542c1dcbd 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -119,6 +119,9 @@ struct panthor_device {
 	/** @csif_info: Command stream interface information. */
 	struct drm_panthor_csif_info csif_info;
 
+	/** @perf_info: Performance counter interface information. */
+	struct drm_panthor_perf_info perf_info;
+
 	/** @gpu: GPU management data. */
 	struct panthor_gpu *gpu;
 
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index ad46a40ed9e1..e0ac3107c69e 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -175,7 +175,9 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
 		 PANTHOR_UOBJ_DECL(struct drm_panthor_sync_op, timeline_value), \
 		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
 		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, ringbuf_size), \
-		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs))
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs), \
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_info, shader_blocks))
+
 
 /**
  * PANTHOR_UOBJ_SET() - Copy a kernel object to a user object.
@@ -834,6 +836,10 @@ static int panthor_ioctl_dev_query(struct drm_device *ddev, void *data, struct d
 			args->size = sizeof(priorities_info);
 			return 0;
 
+		case DRM_PANTHOR_DEV_QUERY_PERF_INFO:
+			args->size = sizeof(ptdev->perf_info);
+			return 0;
+
 		default:
 			return -EINVAL;
 		}
@@ -858,6 +864,9 @@ static int panthor_ioctl_dev_query(struct drm_device *ddev, void *data, struct d
 		panthor_query_group_priorities_info(file, &priorities_info);
 		return PANTHOR_UOBJ_SET(args->pointer, args->size, priorities_info);
 
+	case DRM_PANTHOR_DEV_QUERY_PERF_INFO:
+		return PANTHOR_UOBJ_SET(args->pointer, args->size, ptdev->perf_info);
+
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
index 4a2e36504fea..e9530d1d9781 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -21,6 +21,7 @@
 #include "panthor_gem.h"
 #include "panthor_gpu.h"
 #include "panthor_mmu.h"
+#include "panthor_perf.h"
 #include "panthor_regs.h"
 #include "panthor_sched.h"
 
@@ -1417,6 +1418,9 @@ int panthor_fw_init(struct panthor_device *ptdev)
 		goto err_unplug_fw;
 
 	panthor_fw_init_global_iface(ptdev);
+
+	panthor_perf_info_init(ptdev);
+
 	return 0;
 
 err_unplug_fw:
diff --git a/drivers/gpu/drm/panthor/panthor_fw.h b/drivers/gpu/drm/panthor/panthor_fw.h
index 22448abde992..db10358e24bb 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.h
+++ b/drivers/gpu/drm/panthor/panthor_fw.h
@@ -5,6 +5,7 @@
 #define __PANTHOR_MCU_H__
 
 #include <linux/types.h>
+#include <linux/spinlock.h>
 
 struct panthor_device;
 struct panthor_kernel_bo;
@@ -197,8 +198,11 @@ struct panthor_fw_global_control_iface {
 	u32 output_va;
 	u32 group_num;
 	u32 group_stride;
+#define GLB_PERFCNT_FW_SIZE(x) ((((x) >> 16) << 8))
 	u32 perfcnt_size;
 	u32 instr_features;
+#define PERFCNT_FEATURES_MD_SIZE(x) ((x) & GENMASK(3, 0))
+	u32 perfcnt_features;
 };
 
 struct panthor_fw_global_input_iface {
diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
new file mode 100644
index 000000000000..0e3d769c1805
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_perf.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2023 Collabora Ltd */
+/* Copyright 2024 Arm ltd. */
+
+#include <drm/drm_file.h>
+#include <drm/drm_gem_shmem_helper.h>
+#include <drm/drm_managed.h>
+#include <drm/panthor_drm.h>
+
+#include "panthor_device.h"
+#include "panthor_fw.h"
+#include "panthor_gpu.h"
+#include "panthor_perf.h"
+#include "panthor_regs.h"
+
+/**
+ * PANTHOR_PERF_COUNTERS_PER_BLOCK - On CSF architectures pre-11.x, the number of counters
+ * per block was hardcoded to be 64. Arch 11.0 onwards supports the PRFCNT_FEATURES GPU register,
+ * which indicates the same information.
+ */
+#define PANTHOR_PERF_COUNTERS_PER_BLOCK (64)
+
+void panthor_perf_info_init(struct panthor_device *ptdev)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+	struct drm_panthor_perf_info *const perf_info = &ptdev->perf_info;
+
+	if (PERFCNT_FEATURES_MD_SIZE(glb_iface->control->perfcnt_features))
+		perf_info->flags |= DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT;
+
+	if (GPU_ARCH_MAJOR(ptdev->gpu_info.gpu_id) < 11)
+		perf_info->counters_per_block = PANTHOR_PERF_COUNTERS_PER_BLOCK;
+
+	perf_info->sample_header_size = sizeof(struct drm_panthor_perf_sample_header);
+	perf_info->block_header_size = sizeof(struct drm_panthor_perf_block_header);
+
+	if (GLB_PERFCNT_FW_SIZE(glb_iface->control->perfcnt_size)) {
+		perf_info->fw_blocks = 1;
+		perf_info->csg_blocks = glb_iface->control->group_num;
+	}
+
+	perf_info->cshw_blocks = 1;
+	perf_info->tiler_blocks = 1;
+	perf_info->memsys_blocks = hweight64(ptdev->gpu_info.l2_present);
+	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
+}
+
diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
new file mode 100644
index 000000000000..cff537a370c9
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_perf.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2024 Collabora Ltd */
+/* Copyright 2024 Arm ltd. */
+
+#ifndef __PANTHOR_PERF_H__
+#define __PANTHOR_PERF_H__
+
+struct panthor_device;
+
+void panthor_perf_info_init(struct panthor_device *ptdev);
+
+#endif /* __PANTHOR_PERF_H__ */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v2 3/8] drm/panthor: Add panthor_perf_init and panthor_perf_unplug
  2024-12-11 16:50 [RFC v2 0/8] drm/panthor: Add performance counters with manual sampling mode Lukas Zapolskas
  2024-12-11 16:50 ` [RFC v2 1/8] drm/panthor: Add performance counter uAPI Lukas Zapolskas
  2024-12-11 16:50 ` [RFC v2 2/8] drm/panthor: Add DEV_QUERY.PERF_INFO handling for Gx10 Lukas Zapolskas
@ 2024-12-11 16:50 ` Lukas Zapolskas
  2025-01-27 12:46   ` Adrián Larumbe
  2025-01-27 15:50   ` adrian.larumbe
  2024-12-11 16:50 ` [RFC v2 4/8] drm/panthor: Add panthor perf ioctls Lukas Zapolskas
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2024-12-11 16:50 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Adrián Larumbe
  Cc: dri-devel, linux-kernel, Mihail Atanassov, nd, Lukas Zapolskas

Added the panthor_perf system initialization and unplug code to allow
for the handling of userspace sessions to be added in follow-up patches.

Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
---
 drivers/gpu/drm/panthor/panthor_device.c |  7 +++
 drivers/gpu/drm/panthor/panthor_device.h |  5 +-
 drivers/gpu/drm/panthor/panthor_perf.c   | 77 ++++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_perf.h   |  3 +
 4 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 00f7b8ce935a..1a81a436143b 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -19,6 +19,7 @@
 #include "panthor_fw.h"
 #include "panthor_gpu.h"
 #include "panthor_mmu.h"
+#include "panthor_perf.h"
 #include "panthor_regs.h"
 #include "panthor_sched.h"
 
@@ -97,6 +98,7 @@ void panthor_device_unplug(struct panthor_device *ptdev)
 	/* Now, try to cleanly shutdown the GPU before the device resources
 	 * get reclaimed.
 	 */
+	panthor_perf_unplug(ptdev);
 	panthor_sched_unplug(ptdev);
 	panthor_fw_unplug(ptdev);
 	panthor_mmu_unplug(ptdev);
@@ -262,6 +264,10 @@ int panthor_device_init(struct panthor_device *ptdev)
 	if (ret)
 		goto err_unplug_fw;
 
+	ret = panthor_perf_init(ptdev);
+	if (ret)
+		goto err_unplug_fw;
+
 	/* ~3 frames */
 	pm_runtime_set_autosuspend_delay(ptdev->base.dev, 50);
 	pm_runtime_use_autosuspend(ptdev->base.dev);
@@ -275,6 +281,7 @@ int panthor_device_init(struct panthor_device *ptdev)
 
 err_disable_autosuspend:
 	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
+	panthor_perf_unplug(ptdev);
 	panthor_sched_unplug(ptdev);
 
 err_unplug_fw:
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 636542c1dcbd..aca33d03036c 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -26,7 +26,7 @@ struct panthor_heap_pool;
 struct panthor_job;
 struct panthor_mmu;
 struct panthor_fw;
-struct panthor_perfcnt;
+struct panthor_perf;
 struct panthor_vm;
 struct panthor_vm_pool;
 
@@ -137,6 +137,9 @@ struct panthor_device {
 	/** @devfreq: Device frequency scaling management data. */
 	struct panthor_devfreq *devfreq;
 
+	/** @perf: Performance counter management data. */
+	struct panthor_perf *perf;
+
 	/** @unplug: Device unplug related fields. */
 	struct {
 		/** @lock: Lock used to serialize unplug operations. */
diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
index 0e3d769c1805..e0dc6c4b0cf1 100644
--- a/drivers/gpu/drm/panthor/panthor_perf.c
+++ b/drivers/gpu/drm/panthor/panthor_perf.c
@@ -13,6 +13,24 @@
 #include "panthor_perf.h"
 #include "panthor_regs.h"
 
+struct panthor_perf {
+	/**
+	 * @block_set: The global counter set configured onto the HW.
+	 */
+	u8 block_set;
+
+	/** @next_session: The ID of the next session. */
+	u32 next_session;
+
+	/** @session_range: The number of sessions supported at a time. */
+	struct xa_limit session_range;
+
+	/**
+	 * @sessions: Global map of sessions, accessed by their ID.
+	 */
+	struct xarray sessions;
+};
+
 /**
  * PANTHOR_PERF_COUNTERS_PER_BLOCK - On CSF architectures pre-11.x, the number of counters
  * per block was hardcoded to be 64. Arch 11.0 onwards supports the PRFCNT_FEATURES GPU register,
@@ -45,3 +63,62 @@ void panthor_perf_info_init(struct panthor_device *ptdev)
 	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
 }
 
+/**
+ * panthor_perf_init - Initialize the performance counter subsystem.
+ * @ptdev: Panthor device
+ *
+ * The performance counters require the FW interface to be available to setup the
+ * sampling ringbuffers, so this must be called only after FW is initialized.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int panthor_perf_init(struct panthor_device *ptdev)
+{
+	struct panthor_perf *perf;
+
+	if (!ptdev)
+		return -EINVAL;
+
+	perf = devm_kzalloc(ptdev->base.dev, sizeof(*perf), GFP_KERNEL);
+	if (ZERO_OR_NULL_PTR(perf))
+		return -ENOMEM;
+
+	xa_init_flags(&perf->sessions, XA_FLAGS_ALLOC);
+
+	/* Currently, we only support a single session at a time. */
+	perf->session_range = (struct xa_limit) {
+		.min = 0,
+		.max = 1,
+	};
+
+	drm_info(&ptdev->base, "Performance counter subsystem initialized");
+
+	ptdev->perf = perf;
+
+	return 0;
+}
+
+/**
+ * panthor_perf_unplug - Terminate the performance counter subsystem.
+ * @ptdev: Panthor device.
+ *
+ * This function will terminate the performance counter control structures and any remaining
+ * sessions, after waiting for any pending interrupts.
+ */
+void panthor_perf_unplug(struct panthor_device *ptdev)
+{
+	struct panthor_perf *perf = ptdev->perf;
+
+	if (!perf)
+		return;
+
+	if (!xa_empty(&perf->sessions))
+		drm_err(&ptdev->base,
+				"Performance counter sessions active when unplugging the driver!");
+
+	xa_destroy(&perf->sessions);
+
+	devm_kfree(ptdev->base.dev, ptdev->perf);
+
+	ptdev->perf = NULL;
+}
diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
index cff537a370c9..90af8b18358c 100644
--- a/drivers/gpu/drm/panthor/panthor_perf.h
+++ b/drivers/gpu/drm/panthor/panthor_perf.h
@@ -9,4 +9,7 @@ struct panthor_device;
 
 void panthor_perf_info_init(struct panthor_device *ptdev);
 
+int panthor_perf_init(struct panthor_device *ptdev);
+void panthor_perf_unplug(struct panthor_device *ptdev);
+
 #endif /* __PANTHOR_PERF_H__ */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v2 4/8] drm/panthor: Add panthor perf ioctls
  2024-12-11 16:50 [RFC v2 0/8] drm/panthor: Add performance counters with manual sampling mode Lukas Zapolskas
                   ` (2 preceding siblings ...)
  2024-12-11 16:50 ` [RFC v2 3/8] drm/panthor: Add panthor_perf_init and panthor_perf_unplug Lukas Zapolskas
@ 2024-12-11 16:50 ` Lukas Zapolskas
  2025-01-27 14:06   ` Adrián Larumbe
  2024-12-11 16:50 ` [RFC v2 5/8] drm/panthor: Introduce sampling sessions to handle userspace clients Lukas Zapolskas
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Lukas Zapolskas @ 2024-12-11 16:50 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Adrián Larumbe
  Cc: dri-devel, linux-kernel, Mihail Atanassov, nd, Lukas Zapolskas

This patch implements the PANTHOR_PERF_CONTROL ioctl series, and
a PANTHOR_GET_UOBJ wrapper to deal with the backwards and forwards
compatibility of the uAPI.

Stub function definitions are added to ensure the patch builds on its own,
and will be removed later in the series.

Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
---
 drivers/gpu/drm/panthor/panthor_drv.c  | 155 ++++++++++++++++++++++++-
 drivers/gpu/drm/panthor/panthor_perf.c |  34 ++++++
 drivers/gpu/drm/panthor/panthor_perf.h |  19 +++
 3 files changed, 206 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index e0ac3107c69e..458175f58b15 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -7,6 +7,7 @@
 #include <asm/arch_timer.h>
 #endif
 
+#include <linux/cleanup.h>
 #include <linux/list.h>
 #include <linux/module.h>
 #include <linux/of_platform.h>
@@ -31,6 +32,7 @@
 #include "panthor_gpu.h"
 #include "panthor_heap.h"
 #include "panthor_mmu.h"
+#include "panthor_perf.h"
 #include "panthor_regs.h"
 #include "panthor_sched.h"
 
@@ -73,6 +75,39 @@ panthor_set_uobj(u64 usr_ptr, u32 usr_size, u32 min_size, u32 kern_size, const v
 	return 0;
 }
 
+/**
+ * panthor_get_uobj() - Copy kernel object to user object.
+ * @usr_ptr: Users pointer.
+ * @usr_size: Size of the user object.
+ * @min_size: Minimum size for this object.
+ *
+ * Helper automating kernel -> user object copies.
+ *
+ * Don't use this function directly, use PANTHOR_UOBJ_GET() instead.
+ *
+ * Return: valid pointer on success, an encoded error code otherwise.
+ */
+static void*
+panthor_get_uobj(u64 usr_ptr, u32 usr_size, u32 min_size)
+{
+	int ret;
+	void *out_alloc __free(kvfree) = NULL;
+
+	/* User size shouldn't be smaller than the minimal object size. */
+	if (usr_size < min_size)
+		return ERR_PTR(-EINVAL);
+
+	out_alloc = kvmalloc(min_size, GFP_KERNEL);
+	if (!out_alloc)
+		return ERR_PTR(-ENOMEM);
+
+	ret = copy_struct_from_user(out_alloc, min_size, u64_to_user_ptr(usr_ptr), usr_size);
+	if (ret)
+		return ERR_PTR(ret);
+
+	return_ptr(out_alloc);
+}
+
 /**
  * panthor_get_uobj_array() - Copy a user object array into a kernel accessible object array.
  * @in: The object array to copy.
@@ -176,8 +211,11 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
 		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
 		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, ringbuf_size), \
 		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs), \
-		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_info, shader_blocks))
-
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_info, shader_blocks), \
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_setup, shader_enable_mask), \
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_start, user_data), \
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_stop, user_data), \
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_sample, user_data))
 
 /**
  * PANTHOR_UOBJ_SET() - Copy a kernel object to a user object.
@@ -192,6 +230,24 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
 			 PANTHOR_UOBJ_MIN_SIZE(_src_obj), \
 			 sizeof(_src_obj), &(_src_obj))
 
+/**
+ * PANTHOR_UOBJ_GET() - Copies a user object from _usr_ptr to a kernel accessible _dest_ptr.
+ * @_dest_ptr: Local varialbe
+ * @_usr_size: Size of the user object.
+ * @_usr_ptr: The pointer of the object in userspace.
+ *
+ * Return: Error code. See panthor_get_uobj().
+ */
+#define PANTHOR_UOBJ_GET(_dest_ptr, _usr_size, _usr_ptr) \
+	({ \
+		typeof(_dest_ptr) _tmp; \
+		_tmp = panthor_get_uobj(_usr_ptr, _usr_size, \
+				PANTHOR_UOBJ_MIN_SIZE(_tmp[0])); \
+		if (!IS_ERR(_tmp)) \
+			_dest_ptr = _tmp; \
+		PTR_ERR_OR_ZERO(_tmp); \
+	})
+
 /**
  * PANTHOR_UOBJ_GET_ARRAY() - Copy a user object array to a kernel accessible
  * object array.
@@ -1339,6 +1395,99 @@ static int panthor_ioctl_vm_get_state(struct drm_device *ddev, void *data,
 	return 0;
 }
 
+static int panthor_ioctl_perf_control(struct drm_device *ddev, void *data,
+		struct drm_file *file)
+{
+	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
+	struct panthor_file *pfile = file->driver_priv;
+	struct drm_panthor_perf_control *args = data;
+	int ret;
+
+	if (!args->pointer) {
+		switch (args->cmd) {
+		case DRM_PANTHOR_PERF_COMMAND_SETUP:
+			args->size = sizeof(struct drm_panthor_perf_cmd_setup);
+			return 0;
+
+		case DRM_PANTHOR_PERF_COMMAND_TEARDOWN:
+			args->size = 0;
+			return 0;
+
+		case DRM_PANTHOR_PERF_COMMAND_START:
+			args->size = sizeof(struct drm_panthor_perf_cmd_start);
+			return 0;
+
+		case DRM_PANTHOR_PERF_COMMAND_STOP:
+			args->size = sizeof(struct drm_panthor_perf_cmd_stop);
+			return 0;
+
+		case DRM_PANTHOR_PERF_COMMAND_SAMPLE:
+			args->size = sizeof(struct drm_panthor_perf_cmd_sample);
+			return 0;
+
+		default:
+			return -EINVAL;
+		}
+	}
+
+	switch (args->cmd) {
+	case DRM_PANTHOR_PERF_COMMAND_SETUP:
+	{
+		struct drm_panthor_perf_cmd_setup *setup_args __free(kvfree) = NULL;
+
+		ret = PANTHOR_UOBJ_GET(setup_args, args->size, args->pointer);
+		if (ret)
+			return -EINVAL;
+
+		if (setup_args->pad[0])
+			return -EINVAL;
+
+		ret = panthor_perf_session_setup(ptdev, ptdev->perf, setup_args, pfile);
+
+		return ret;
+	}
+	case DRM_PANTHOR_PERF_COMMAND_TEARDOWN:
+	{
+		return panthor_perf_session_teardown(pfile, ptdev->perf, args->handle);
+	}
+	case DRM_PANTHOR_PERF_COMMAND_START:
+	{
+		struct drm_panthor_perf_cmd_start *start_args __free(kvfree) = NULL;
+
+		ret = PANTHOR_UOBJ_GET(start_args, args->size, args->pointer);
+		if (ret)
+			return -EINVAL;
+
+		return panthor_perf_session_start(pfile, ptdev->perf, args->handle,
+				start_args->user_data);
+	}
+	case DRM_PANTHOR_PERF_COMMAND_STOP:
+	{
+		struct drm_panthor_perf_cmd_stop *stop_args __free(kvfree) = NULL;
+
+		ret = PANTHOR_UOBJ_GET(stop_args, args->size, args->pointer);
+		if (ret)
+			return -EINVAL;
+
+		return panthor_perf_session_stop(pfile, ptdev->perf, args->handle,
+				stop_args->user_data);
+	}
+	case DRM_PANTHOR_PERF_COMMAND_SAMPLE:
+	{
+		struct drm_panthor_perf_cmd_sample *sample_args __free(kvfree) = NULL;
+
+		ret = PANTHOR_UOBJ_GET(sample_args, args->size, args->pointer);
+		if (ret)
+			return -EINVAL;
+
+		return panthor_perf_session_sample(pfile, ptdev->perf, args->handle,
+					sample_args->user_data);
+	}
+	default:
+		return -EINVAL;
+	}
+}
+
 static int
 panthor_open(struct drm_device *ddev, struct drm_file *file)
 {
@@ -1386,6 +1535,7 @@ panthor_postclose(struct drm_device *ddev, struct drm_file *file)
 
 	panthor_group_pool_destroy(pfile);
 	panthor_vm_pool_destroy(pfile);
+	panthor_perf_session_destroy(pfile, pfile->ptdev->perf);
 
 	kfree(pfile);
 	module_put(THIS_MODULE);
@@ -1408,6 +1558,7 @@ static const struct drm_ioctl_desc panthor_drm_driver_ioctls[] = {
 	PANTHOR_IOCTL(TILER_HEAP_CREATE, tiler_heap_create, DRM_RENDER_ALLOW),
 	PANTHOR_IOCTL(TILER_HEAP_DESTROY, tiler_heap_destroy, DRM_RENDER_ALLOW),
 	PANTHOR_IOCTL(GROUP_SUBMIT, group_submit, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(PERF_CONTROL, perf_control, DRM_RENDER_ALLOW),
 };
 
 static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
index e0dc6c4b0cf1..6498279ec036 100644
--- a/drivers/gpu/drm/panthor/panthor_perf.c
+++ b/drivers/gpu/drm/panthor/panthor_perf.c
@@ -63,6 +63,40 @@ void panthor_perf_info_init(struct panthor_device *ptdev)
 	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
 }
 
+int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
+		struct drm_panthor_perf_cmd_setup *setup_args,
+		struct panthor_file *pfile)
+{
+	return -EOPNOTSUPP;
+}
+
+int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf,
+		u32 sid)
+{
+	return -EOPNOTSUPP;
+}
+
+int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
+		u32 sid, u64 user_data)
+{
+	return -EOPNOTSUPP;
+}
+
+int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
+		u32 sid, u64 user_data)
+{
+		return -EOPNOTSUPP;
+}
+
+int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
+		u32 sid, u64 user_data)
+{
+	return -EOPNOTSUPP;
+
+}
+
+void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf) { }
+
 /**
  * panthor_perf_init - Initialize the performance counter subsystem.
  * @ptdev: Panthor device
diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
index 90af8b18358c..bfef8874068b 100644
--- a/drivers/gpu/drm/panthor/panthor_perf.h
+++ b/drivers/gpu/drm/panthor/panthor_perf.h
@@ -5,11 +5,30 @@
 #ifndef __PANTHOR_PERF_H__
 #define __PANTHOR_PERF_H__
 
+#include <linux/types.h>
+
+struct drm_gem_object;
+struct drm_panthor_perf_cmd_setup;
 struct panthor_device;
+struct panthor_file;
+struct panthor_perf;
 
 void panthor_perf_info_init(struct panthor_device *ptdev);
 
 int panthor_perf_init(struct panthor_device *ptdev);
 void panthor_perf_unplug(struct panthor_device *ptdev);
 
+int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
+		struct drm_panthor_perf_cmd_setup *setup_args,
+		struct panthor_file *pfile);
+int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf,
+		u32 sid);
+int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
+		u32 sid, u64 user_data);
+int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
+		u32 sid, u64 user_data);
+int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
+		u32 sid, u64 user_data);
+void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf);
+
 #endif /* __PANTHOR_PERF_H__ */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v2 5/8] drm/panthor: Introduce sampling sessions to handle userspace clients
  2024-12-11 16:50 [RFC v2 0/8] drm/panthor: Add performance counters with manual sampling mode Lukas Zapolskas
                   ` (3 preceding siblings ...)
  2024-12-11 16:50 ` [RFC v2 4/8] drm/panthor: Add panthor perf ioctls Lukas Zapolskas
@ 2024-12-11 16:50 ` Lukas Zapolskas
  2025-01-27 15:43   ` Adrián Larumbe
  2025-01-27 21:39   ` Adrián Larumbe
  2024-12-11 16:50 ` [RFC v2 6/8] drm/panthor: Implement the counter sampler and sample handling Lukas Zapolskas
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2024-12-11 16:50 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Adrián Larumbe
  Cc: dri-devel, linux-kernel, Mihail Atanassov, nd, Lukas Zapolskas

To allow for combining the requests from multiple userspace clients, an
intermediary layer between the HW/FW interfaces and userspace is
created, containing the information for the counter requests and
tracking of insert and extract indices. Each session starts inactive and
must be explicitly activated via PERF_CONTROL.START, and explicitly
stopped via PERF_CONTROL.STOP. Userspace identifies a single client with
its session ID and the panthor file it is associated with.

The SAMPLE and STOP commands both produce a single sample when called,
and these samples can be disambiguated via the opaque user data field
passed in the PERF_CONTROL uAPI. If this functionality is not desired,
these fields can be kept as zero, as the kernel copies this value into
the corresponding sample without attempting to interpret it.

Currently, only manual sampling sessions are supported, providing
samples when userspace calls PERF_CONTROL.SAMPLE, and only a single
session is allowed at a time. Multiple sessions and periodic sampling
will be enabled in following patches.

No protected is provided against the 32-bit hardware counter overflows,
so for the moment it is up to userspace to ensure that the counters are
sampled at a reasonable frequency.

The counter set enum is added to the uapi to clarify the restrictions on
calling the interface.

Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
---
 drivers/gpu/drm/panthor/panthor_device.h |   3 +
 drivers/gpu/drm/panthor/panthor_drv.c    |   1 +
 drivers/gpu/drm/panthor/panthor_perf.c   | 697 ++++++++++++++++++++++-
 include/uapi/drm/panthor_drm.h           |  50 +-
 4 files changed, 732 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index aca33d03036c..9ed1e9aed521 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -210,6 +210,9 @@ struct panthor_file {
 	/** @ptdev: Device attached to this file. */
 	struct panthor_device *ptdev;
 
+	/** @drm_file: Corresponding drm_file */
+	struct drm_file *drm_file;
+
 	/** @vms: VM pool attached to this file. */
 	struct panthor_vm_pool *vms;
 
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index 458175f58b15..2848ab442d10 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1505,6 +1505,7 @@ panthor_open(struct drm_device *ddev, struct drm_file *file)
 	}
 
 	pfile->ptdev = ptdev;
+	pfile->drm_file = file;
 
 	ret = panthor_vm_pool_create(pfile);
 	if (ret)
diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
index 6498279ec036..42d8b6f8c45d 100644
--- a/drivers/gpu/drm/panthor/panthor_perf.c
+++ b/drivers/gpu/drm/panthor/panthor_perf.c
@@ -3,16 +3,162 @@
 /* Copyright 2024 Arm ltd. */
 
 #include <drm/drm_file.h>
+#include <drm/drm_gem.h>
 #include <drm/drm_gem_shmem_helper.h>
 #include <drm/drm_managed.h>
+#include <drm/drm_print.h>
 #include <drm/panthor_drm.h>
 
+#include <linux/circ_buf.h>
+#include <linux/iosys-map.h>
+#include <linux/pm_runtime.h>
+
 #include "panthor_device.h"
 #include "panthor_fw.h"
 #include "panthor_gpu.h"
 #include "panthor_perf.h"
 #include "panthor_regs.h"
 
+/**
+ * PANTHOR_PERF_EM_BITS - Number of bits in a user-facing enable mask. This must correspond
+ *                        to the maximum number of counters available for selection on the newest
+ *                        Mali GPUs (128 as of the Mali-Gx15).
+ */
+#define PANTHOR_PERF_EM_BITS (BITS_PER_TYPE(u64) * 2)
+
+/**
+ * enum panthor_perf_session_state - Session state bits.
+ */
+enum panthor_perf_session_state {
+	/** @PANTHOR_PERF_SESSION_ACTIVE: The session is active and can be used for sampling. */
+	PANTHOR_PERF_SESSION_ACTIVE = 0,
+
+	/**
+	 * @PANTHOR_PERF_SESSION_OVERFLOW: The session encountered an overflow in one of the
+	 *                                 counters during the last sampling period. This flag
+	 *                                 gets propagated as part of samples emitted for this
+	 *                                 session, to ensure the userspace client can gracefully
+	 *                                 handle this data corruption.
+	 */
+	PANTHOR_PERF_SESSION_OVERFLOW,
+
+	/** @PANTHOR_PERF_SESSION_MAX: Bits needed to represent the state. Must be last.*/
+	PANTHOR_PERF_SESSION_MAX,
+};
+
+struct panthor_perf_enable_masks {
+	/**
+	 * @link: List node used to keep track of the enable masks aggregated by the sampler.
+	 */
+	struct list_head link;
+
+	/** @refs: Number of references taken out on an instantiated enable mask. */
+	struct kref refs;
+
+	/**
+	 * @mask: Array of bitmasks indicating the counters userspace requested, where
+	 *        one bit represents a single counter. Used to build the firmware configuration
+	 *        and ensure that userspace clients obtain only the counters they requested.
+	 */
+	DECLARE_BITMAP(mask, PANTHOR_PERF_EM_BITS)[DRM_PANTHOR_PERF_BLOCK_MAX];
+};
+
+struct panthor_perf_counter_block {
+	struct drm_panthor_perf_block_header header;
+	u64 counters[];
+};
+
+struct panthor_perf_session {
+	DECLARE_BITMAP(state, PANTHOR_PERF_SESSION_MAX);
+
+	/**
+	 * @user_sample_size: The size of a single sample as exposed to userspace. For the sake of
+	 *                    simplicity, the current implementation exposes the same structure
+	 *                    as provided by firmware, after annotating the sample and the blocks,
+	 *                    and zero-extending the counters themselves (to account for in-kernel
+	 *                    accumulation).
+	 *
+	 *                    This may also allow further memory-optimizations of compressing the
+	 *                    sample to provide only requested blocks, if deemed to be worth the
+	 *                    additional complexity.
+	 */
+	size_t user_sample_size;
+
+	/**
+	 * @sample_freq_ns: Period between subsequent sample requests. Zero indicates that
+	 *                  userspace will be responsible for requesting samples.
+	 */
+	u64 sample_freq_ns;
+
+	/** @sample_start_ns: Sample request time, obtained from a monotonic raw clock. */
+	u64 sample_start_ns;
+
+	/**
+	 * @user_data: Opaque handle passed in when starting a session, requesting a sample (for
+	 *             manual sampling sessions only) and when stopping a session. This handle
+	 *             allows the disambiguation of a sample in the ringbuffer.
+	 */
+	u64 user_data;
+
+	/**
+	 * @eventfd: Event file descriptor context used to signal userspace of a new sample
+	 *           being emitted.
+	 */
+	struct eventfd_ctx *eventfd;
+
+	/**
+	 * @enabled_counters: This session's requested counters. Note that these cannot change
+	 *                    for the lifetime of the session.
+	 */
+	struct panthor_perf_enable_masks *enabled_counters;
+
+	/** @ringbuf_slots: Slots in the user-facing ringbuffer. */
+	size_t ringbuf_slots;
+
+	/** @ring_buf: BO for the userspace ringbuffer. */
+	struct drm_gem_object *ring_buf;
+
+	/**
+	 * @control_buf: BO for the insert and extract indices.
+	 */
+	struct drm_gem_object *control_buf;
+
+	/**
+	 * @extract_idx: The extract index is used by userspace to indicate the position of the
+	 *               consumer in the ringbuffer.
+	 */
+	u32 *extract_idx;
+
+	/**
+	 * @insert_idx: The insert index is used by the kernel to indicate the position of the
+	 *              latest sample exposed to userspace.
+	 */
+	u32 *insert_idx;
+
+	/** @samples: The mapping of the @ring_buf into the kernel's VA space. */
+	u8 *samples;
+
+	/**
+	 * @waiting: The list node used by the sampler to track the sessions waiting for a sample.
+	 */
+	struct list_head waiting;
+
+	/**
+	 * @pfile: The panthor file which was used to create a session, used for the postclose
+	 *         handling and to prevent a misconfigured userspace from closing unrelated
+	 *         sessions.
+	 */
+	struct panthor_file *pfile;
+
+	/**
+	 * @ref: Session reference count. The sample delivery to userspace is asynchronous, meaning
+	 *       the lifetime of the session must extend at least until the sample is exposed to
+	 *       userspace.
+	 */
+	struct kref ref;
+};
+
+
 struct panthor_perf {
 	/**
 	 * @block_set: The global counter set configured onto the HW.
@@ -63,39 +209,154 @@ void panthor_perf_info_init(struct panthor_device *ptdev)
 	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
 }
 
-int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
-		struct drm_panthor_perf_cmd_setup *setup_args,
-		struct panthor_file *pfile)
+static struct panthor_perf_enable_masks *panthor_perf_em_new(void)
 {
-	return -EOPNOTSUPP;
+	struct panthor_perf_enable_masks *em = kmalloc(sizeof(*em), GFP_KERNEL);
+
+	if (!em)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&em->link);
+
+	kref_init(&em->refs);
+
+	return em;
 }
 
-int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf,
-		u32 sid)
+static struct panthor_perf_enable_masks *panthor_perf_create_em(struct drm_panthor_perf_cmd_setup
+		*setup_args)
 {
-	return -EOPNOTSUPP;
+	struct panthor_perf_enable_masks *em = panthor_perf_em_new();
+
+	if (IS_ERR_OR_NULL(em))
+		return em;
+
+	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_FW],
+			setup_args->fw_enable_mask, PANTHOR_PERF_EM_BITS);
+	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_CSG],
+			setup_args->csg_enable_mask, PANTHOR_PERF_EM_BITS);
+	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_CSHW],
+			setup_args->cshw_enable_mask, PANTHOR_PERF_EM_BITS);
+	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_TILER],
+			setup_args->tiler_enable_mask, PANTHOR_PERF_EM_BITS);
+	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_MEMSYS],
+			setup_args->memsys_enable_mask, PANTHOR_PERF_EM_BITS);
+	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_SHADER],
+			setup_args->shader_enable_mask, PANTHOR_PERF_EM_BITS);
+
+	return em;
 }
 
-int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
-		u32 sid, u64 user_data)
+static void panthor_perf_destroy_em_kref(struct kref *em_kref)
 {
-	return -EOPNOTSUPP;
+	struct panthor_perf_enable_masks *em = container_of(em_kref, typeof(*em), refs);
+
+	if (!list_empty(&em->link))
+		return;
+
+	kfree(em);
 }
 
-int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
-		u32 sid, u64 user_data)
+static size_t get_annotated_block_size(size_t counters_per_block)
 {
-		return -EOPNOTSUPP;
+	return struct_size_t(struct panthor_perf_counter_block, counters, counters_per_block);
 }
 
-int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
-		u32 sid, u64 user_data)
+static u32 session_read_extract_idx(struct panthor_perf_session *session)
+{
+	/* Userspace will update their own extract index to indicate that a sample is consumed
+	 * from the ringbuffer, and we must ensure we read the latest value.
+	 */
+	return smp_load_acquire(session->extract_idx);
+}
+
+static u32 session_read_insert_idx(struct panthor_perf_session *session)
+{
+	return *session->insert_idx;
+}
+
+static void session_get(struct panthor_perf_session *session)
+{
+	kref_get(&session->ref);
+}
+
+static void session_free(struct kref *ref)
+{
+	struct panthor_perf_session *session = container_of(ref, typeof(*session), ref);
+
+	if (session->samples) {
+		struct iosys_map map = IOSYS_MAP_INIT_VADDR(session->samples);
+
+		drm_gem_vunmap_unlocked(session->ring_buf, &map);
+		drm_gem_object_put(session->ring_buf);
+	}
+
+	if (session->insert_idx && session->extract_idx) {
+		struct iosys_map map = IOSYS_MAP_INIT_VADDR(session->extract_idx);
+
+		drm_gem_vunmap_unlocked(session->control_buf, &map);
+		drm_gem_object_put(session->control_buf);
+	}
+
+	kref_put(&session->enabled_counters->refs, panthor_perf_destroy_em_kref);
+	eventfd_ctx_put(session->eventfd);
+
+	devm_kfree(session->pfile->ptdev->base.dev, session);
+}
+
+static void session_put(struct panthor_perf_session *session)
+{
+	kref_put(&session->ref, session_free);
+}
+
+/**
+ * session_find - Find a session associated with the given session ID and
+ *                panthor_file.
+ * @pfile: Panthor file.
+ * @perf: Panthor perf.
+ * @sid: Session ID.
+ *
+ * The reference count of a valid session is increased to ensure it does not disappear
+ * in the window between the XA lock being dropped and the internal session functions
+ * being called.
+ *
+ * Return: valid session pointer or an ERR_PTR.
+ */
+static struct panthor_perf_session *session_find(struct panthor_file *pfile,
+		struct panthor_perf *perf, u32 sid)
 {
-	return -EOPNOTSUPP;
+	struct panthor_perf_session *session;
 
+	if (!perf)
+		return ERR_PTR(-EINVAL);
+
+	xa_lock(&perf->sessions);
+	session = xa_load(&perf->sessions, sid);
+
+	if (!session || xa_is_err(session)) {
+		xa_unlock(&perf->sessions);
+		return ERR_PTR(-EBADF);
+	}
+
+	if (session->pfile != pfile) {
+		xa_unlock(&perf->sessions);
+		return ERR_PTR(-EINVAL);
+	}
+
+	session_get(session);
+	xa_unlock(&perf->sessions);
+
+	return session;
 }
 
-void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf) { }
+static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *const info)
+{
+	const size_t block_size = get_annotated_block_size(info->counters_per_block);
+	const size_t block_nr = info->cshw_blocks + info->csg_blocks + info->fw_blocks +
+		info->tiler_blocks + info->memsys_blocks + info->shader_blocks;
+
+	return sizeof(struct drm_panthor_perf_sample_header) + (block_size * block_nr);
+}
 
 /**
  * panthor_perf_init - Initialize the performance counter subsystem.
@@ -130,6 +391,399 @@ int panthor_perf_init(struct panthor_device *ptdev)
 	ptdev->perf = perf;
 
 	return 0;
+
+}
+
+static int session_validate_set(u8 set)
+{
+	if (set > DRM_PANTHOR_PERF_SET_TERTIARY)
+		return -EINVAL;
+
+	if (set == DRM_PANTHOR_PERF_SET_PRIMARY)
+		return 0;
+
+	if (set > DRM_PANTHOR_PERF_SET_PRIMARY)
+		return capable(CAP_PERFMON) ? 0 : -EACCES;
+
+	return -EINVAL;
+}
+
+/**
+ * panthor_perf_session_setup - Create a user-visible session.
+ *
+ * @ptdev: Handle to the panthor device.
+ * @perf: Handle to the perf control structure.
+ * @setup_args: Setup arguments passed in via ioctl.
+ * @pfile: Panthor file associated with the request.
+ *
+ * Creates a new session associated with the session ID returned. When initialized, the
+ * session must explicitly request sampling to start with a successive call to PERF_CONTROL.START.
+ *
+ * Return: non-negative session identifier on success or negative error code on failure.
+ */
+int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
+		struct drm_panthor_perf_cmd_setup *setup_args,
+		struct panthor_file *pfile)
+{
+	struct panthor_perf_session *session;
+	struct drm_gem_object *ringbuffer;
+	struct drm_gem_object *control;
+	const size_t slots = setup_args->sample_slots;
+	struct panthor_perf_enable_masks *em;
+	struct iosys_map rb_map, ctrl_map;
+	size_t user_sample_size;
+	int session_id;
+	int ret;
+
+	ret = session_validate_set(setup_args->block_set);
+	if (ret)
+		return ret;
+
+	session = devm_kzalloc(ptdev->base.dev, sizeof(*session), GFP_KERNEL);
+	if (ZERO_OR_NULL_PTR(session))
+		return -ENOMEM;
+
+	ringbuffer = drm_gem_object_lookup(pfile->drm_file, setup_args->ringbuf_handle);
+	if (!ringbuffer) {
+		ret = -EINVAL;
+		goto cleanup_session;
+	}
+
+	control = drm_gem_object_lookup(pfile->drm_file, setup_args->control_handle);
+	if (!control) {
+		ret = -EINVAL;
+		goto cleanup_ringbuf;
+	}
+
+	user_sample_size = session_get_max_sample_size(&ptdev->perf_info) * slots;
+
+	if (ringbuffer->size != PFN_ALIGN(user_sample_size)) {
+		ret = -ENOMEM;
+		goto cleanup_control;
+	}
+
+	ret = drm_gem_vmap_unlocked(ringbuffer, &rb_map);
+	if (ret)
+		goto cleanup_control;
+
+
+	ret = drm_gem_vmap_unlocked(control, &ctrl_map);
+	if (ret)
+		goto cleanup_ring_map;
+
+	session->eventfd = eventfd_ctx_fdget(setup_args->fd);
+	if (IS_ERR_OR_NULL(session->eventfd)) {
+		ret = PTR_ERR_OR_ZERO(session->eventfd) ?: -EINVAL;
+		goto cleanup_control_map;
+	}
+
+	em = panthor_perf_create_em(setup_args);
+	if (IS_ERR_OR_NULL(em)) {
+		ret = -ENOMEM;
+		goto cleanup_eventfd;
+	}
+
+	INIT_LIST_HEAD(&session->waiting);
+	session->extract_idx = ctrl_map.vaddr;
+	*session->extract_idx = 0;
+	session->insert_idx = session->extract_idx + 1;
+	*session->insert_idx = 0;
+
+	session->samples = rb_map.vaddr;
+
+	/* TODO This will need validation when we support periodic sampling sessions */
+	if (setup_args->sample_freq_ns) {
+		ret = -EOPNOTSUPP;
+		goto cleanup_em;
+	}
+
+	session->sample_freq_ns = setup_args->sample_freq_ns;
+	session->user_sample_size = user_sample_size;
+	session->enabled_counters = em;
+	session->ring_buf = ringbuffer;
+	session->control_buf = control;
+	session->pfile = pfile;
+
+	ret = xa_alloc_cyclic(&perf->sessions, &session_id, session, perf->session_range,
+			&perf->next_session, GFP_KERNEL);
+	if (ret < 0)
+		goto cleanup_em;
+
+	kref_init(&session->ref);
+
+	return session_id;
+
+cleanup_em:
+	kref_put(&em->refs, panthor_perf_destroy_em_kref);
+
+cleanup_eventfd:
+	eventfd_ctx_put(session->eventfd);
+
+cleanup_control_map:
+	drm_gem_vunmap_unlocked(control, &ctrl_map);
+
+cleanup_ring_map:
+	drm_gem_vunmap_unlocked(ringbuffer, &rb_map);
+
+cleanup_control:
+	drm_gem_object_put(control);
+
+cleanup_ringbuf:
+	drm_gem_object_put(ringbuffer);
+
+cleanup_session:
+	devm_kfree(ptdev->base.dev, session);
+
+	return ret;
+}
+
+static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *session,
+		u64 user_data)
+{
+	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
+		return 0;
+
+	const u32 extract_idx = session_read_extract_idx(session);
+	const u32 insert_idx = session_read_insert_idx(session);
+
+	/* Must have at least one slot remaining in the ringbuffer to sample. */
+	if (WARN_ON_ONCE(!CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots)))
+		return -EBUSY;
+
+	session->user_data = user_data;
+
+	clear_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
+
+	/* TODO Calls to the FW interface will go here in later patches. */
+	return 0;
+}
+
+static int session_start(struct panthor_perf *perf, struct panthor_perf_session *session,
+		u64 user_data)
+{
+	if (test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
+		return 0;
+
+	set_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
+
+	/*
+	 * For manual sampling sessions, a start command does not correspond to a sample,
+	 * and so the user data gets discarded.
+	 */
+	if (session->sample_freq_ns)
+		session->user_data = user_data;
+
+	/* TODO Calls to the FW interface will go here in later patches. */
+	return 0;
+}
+
+static int session_sample(struct panthor_perf *perf, struct panthor_perf_session *session,
+		u64 user_data)
+{
+	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
+		return -EACCES;
+
+	const u32 extract_idx = session_read_extract_idx(session);
+	const u32 insert_idx = session_read_insert_idx(session);
+
+	/* Manual sampling for periodic sessions is forbidden. */
+	if (session->sample_freq_ns)
+		return -EINVAL;
+
+	/*
+	 * Must have at least two slots remaining in the ringbuffer to sample: one for
+	 * the current sample, and one for a stop sample, since a stop command should
+	 * always be acknowledged by taking a final sample and stopping the session.
+	 */
+	if (CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots) < 2)
+		return -EBUSY;
+
+	session->sample_start_ns = ktime_get_raw_ns();
+	session->user_data = user_data;
+
+	/* TODO Calls to the FW interface will go here in later patches. */
+	return 0;
+}
+
+static int session_destroy(struct panthor_perf *perf, struct panthor_perf_session *session)
+{
+	session_put(session);
+
+	return 0;
+}
+
+static int session_teardown(struct panthor_perf *perf, struct panthor_perf_session *session)
+{
+	if (test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
+		return -EINVAL;
+
+	if (!list_empty(&session->waiting))
+		return -EBUSY;
+
+	return session_destroy(perf, session);
+}
+
+/**
+ * panthor_perf_session_teardown - Teardown the session associated with the @sid.
+ * @pfile: Open panthor file.
+ * @perf: Handle to the perf control structure.
+ * @sid: Session identifier.
+ *
+ * Destroys a stopped session where the last sample has been explicitly consumed
+ * or discarded. Active sessions will be ignored.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf, u32 sid)
+{
+	int err;
+	struct panthor_perf_session *session;
+
+	xa_lock(&perf->sessions);
+	session = __xa_store(&perf->sessions, sid, NULL, GFP_KERNEL);
+
+	if (xa_is_err(session)) {
+		err = xa_err(session);
+		goto restore;
+	}
+
+	if (session->pfile != pfile) {
+		err = -EINVAL;
+		goto restore;
+	}
+
+	session_get(session);
+	xa_unlock(&perf->sessions);
+
+	err = session_teardown(perf, session);
+
+	session_put(session);
+
+	return err;
+
+restore:
+	__xa_store(&perf->sessions, sid, session, GFP_KERNEL);
+	xa_unlock(&perf->sessions);
+
+	return err;
+}
+
+/**
+ * panthor_perf_session_start - Start sampling on a stopped session.
+ * @pfile: Open panthor file.
+ * @perf: Handle to the panthor perf control structure.
+ * @sid: Session identifier for the desired session.
+ * @user_data: An opaque value passed in from userspace.
+ *
+ * A session counts as stopped when it is created or when it is explicitly stopped after being
+ * started. Starting an active session is treated as a no-op.
+ *
+ * The @user_data parameter will be associated with all subsequent samples for a periodic
+ * sampling session and will be ignored for manual sampling ones in favor of the user data
+ * passed in the PERF_CONTROL.SAMPLE ioctl call.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
+		u32 sid, u64 user_data)
+{
+	struct panthor_perf_session *session = session_find(pfile, perf, sid);
+	int err;
+
+	if (IS_ERR_OR_NULL(session))
+		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
+
+	err = session_start(perf, session, user_data);
+
+	session_put(session);
+
+	return err;
+}
+
+/**
+ * panthor_perf_session_stop - Stop sampling on an active session.
+ * @pfile: Open panthor file.
+ * @perf: Handle to the panthor perf control structure.
+ * @sid: Session identifier for the desired session.
+ * @user_data: An opaque value passed in from userspace.
+ *
+ * A session counts as active when it has been explicitly started via the PERF_CONTROL.START
+ * ioctl. Stopping a stopped session is treated as a no-op.
+ *
+ * To ensure data is not lost when sampling is stopping, there must always be at least one slot
+ * available for the final automatic sample, and the stop command will be rejected if there is not.
+ *
+ * The @user_data will always be associated with the final sample.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
+		u32 sid, u64 user_data)
+{
+	struct panthor_perf_session *session = session_find(pfile, perf, sid);
+	int err;
+
+	if (IS_ERR_OR_NULL(session))
+		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
+
+	err = session_stop(perf, session, user_data);
+
+	session_put(session);
+
+	return err;
+}
+
+/**
+ * panthor_perf_session_sample - Request a sample on a manual sampling session.
+ * @pfile: Open panthor file.
+ * @perf: Handle to the panthor perf control structure.
+ * @sid: Session identifier for the desired session.
+ * @user_data: An opaque value passed in from userspace.
+ *
+ * Only an active manual sampler is permitted to request samples directly. Failing to meet either
+ * of these conditions will cause the sampling request to be rejected. Requesting a manual sample
+ * with a full ringbuffer will see the request being rejected.
+ *
+ * The @user_data will always be unambiguously associated one-to-one with the resultant sample.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
+		u32 sid, u64 user_data)
+{
+	struct panthor_perf_session *session = session_find(pfile, perf, sid);
+	int err;
+
+	if (IS_ERR_OR_NULL(session))
+		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
+
+	err = session_sample(perf, session, user_data);
+
+	session_put(session);
+
+	return err;
+}
+
+/**
+ * panthor_perf_session_destroy - Destroy a sampling session associated with the @pfile.
+ * @perf: Handle to the panthor perf control structure.
+ * @pfile: The file being closed.
+ *
+ * Must be called when the corresponding userspace process is destroyed and cannot close its
+ * own sessions. As such, we offer no guarantees about data delivery.
+ */
+void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf)
+{
+	unsigned long sid;
+	struct panthor_perf_session *session;
+
+	xa_for_each(&perf->sessions, sid, session)
+	{
+		if (session->pfile == pfile) {
+			session_destroy(perf, session);
+			xa_erase(&perf->sessions, sid);
+		}
+	}
 }
 
 /**
@@ -146,10 +800,17 @@ void panthor_perf_unplug(struct panthor_device *ptdev)
 	if (!perf)
 		return;
 
-	if (!xa_empty(&perf->sessions))
+	if (!xa_empty(&perf->sessions)) {
+		unsigned long sid;
+		struct panthor_perf_session *session;
+
 		drm_err(&ptdev->base,
 				"Performance counter sessions active when unplugging the driver!");
 
+		xa_for_each(&perf->sessions, sid, session)
+			session_destroy(perf, session);
+	}
+
 	xa_destroy(&perf->sessions);
 
 	devm_kfree(ptdev->base.dev, ptdev->perf);
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index 8a431431da6b..576d3ad46e6d 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -458,6 +458,12 @@ enum drm_panthor_perf_block_type {
 
 	/** @DRM_PANTHOR_PERF_BLOCK_SHADER: A shader core counter block. */
 	DRM_PANTHOR_PERF_BLOCK_SHADER,
+
+	/** @DRM_PANTHOR_PERF_BLOCK_LAST: Internal use only. */
+	DRM_PANTHOR_PERF_BLOCK_LAST = DRM_PANTHOR_PERF_BLOCK_SHADER,
+
+	/** @DRM_PANTHOR_PERF_BLOCK_MAX: Internal use only. */
+	DRM_PANTHOR_PERF_BLOCK_MAX = DRM_PANTHOR_PERF_BLOCK_LAST + 1,
 };
 
 /**
@@ -1368,6 +1374,44 @@ struct drm_panthor_perf_control {
 	__u64 pointer;
 };
 
+/**
+ * enum drm_panthor_perf_counter_set - The counter set to be requested from the hardware.
+ *
+ * The hardware supports a single performance counter set at a time, so requesting any set other
+ * than the primary may fail if another process is sampling at the same time.
+ *
+ * If in doubt, the primary counter set has the most commonly used counters and requires no
+ * additional permissions to open.
+ */
+enum drm_panthor_perf_counter_set {
+	/**
+	 * @DRM_PANTHOR_PERF_SET_PRIMARY: The default set configured on the hardware.
+	 *
+	 * This is the only set for which all counters in all blocks are defined.
+	 */
+	DRM_PANTHOR_PERF_SET_PRIMARY,
+
+	/**
+	 * @DRM_PANTHOR_PERF_SET_SECONDARY: The secondary performance counter set.
+	 *
+	 * Some blocks may not have any defined counters for this set, and the block will
+	 * have the UNAVAILABLE block state permanently set in the block header.
+	 *
+	 * Accessing this set requires the calling process to have the CAP_PERFMON capability.
+	 */
+	DRM_PANTHOR_PERF_SET_SECONDARY,
+
+	/**
+	 * @DRM_PANTHOR_PERF_SET_TERTIARY: The tertiary performance counter set.
+	 *
+	 * Some blocks may not have any defined counters for this set, and the block will have
+	 * the UNAVAILABLE block state permanently set in the block header. Note that the
+	 * tertiary set has the fewest defined counter blocks.
+	 *
+	 * Accessing this set requires the calling process to have the CAP_PERFMON capability.
+	 */
+	DRM_PANTHOR_PERF_SET_TERTIARY,
+};
 
 /**
  * struct drm_panthor_perf_cmd_setup - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
@@ -1375,13 +1419,17 @@ struct drm_panthor_perf_control {
  */
 struct drm_panthor_perf_cmd_setup {
 	/**
-	 * @block_set: Set of performance counter blocks.
+	 * @block_set: Set of performance counter blocks, member of
+	 *             enum drm_panthor_perf_block_set.
 	 *
 	 * This is a global configuration and only one set can be active at a time. If
 	 * another client has already requested a counter set, any further requests
 	 * for a different counter set will fail and return an -EBUSY.
 	 *
 	 * If the requested set does not exist, the request will fail and return an -EINVAL.
+	 *
+	 * Some sets have additional requirements to be enabled, and the setup request will
+	 * fail with an -EACCES if these requirements are not satisfied.
 	 */
 	__u8 block_set;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v2 6/8] drm/panthor: Implement the counter sampler and sample handling
  2024-12-11 16:50 [RFC v2 0/8] drm/panthor: Add performance counters with manual sampling mode Lukas Zapolskas
                   ` (4 preceding siblings ...)
  2024-12-11 16:50 ` [RFC v2 5/8] drm/panthor: Introduce sampling sessions to handle userspace clients Lukas Zapolskas
@ 2024-12-11 16:50 ` Lukas Zapolskas
  2025-01-27 16:53   ` Adrián Larumbe
  2025-01-27 21:09   ` Adrián Larumbe
  2024-12-11 16:50 ` [RFC v2 7/8] drm/panthor: Add suspend/resume handling for the performance counters Lukas Zapolskas
  2024-12-11 16:50 ` [RFC v2 8/8] drm/panthor: Expose the panthor perf ioctls Lukas Zapolskas
  7 siblings, 2 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2024-12-11 16:50 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Adrián Larumbe
  Cc: dri-devel, linux-kernel, Mihail Atanassov, nd, Lukas Zapolskas

From: Adrián Larumbe <adrian.larumbe@collabora.com>

The sampler aggregates counter and set requests coming from userspace
and mediates interactions with the FW interface, to ensure that user
sessions cannot override the global configuration.

From the top-level interface, the sampler supports two different types
of samples: clearing samples and regular samples. Clearing samples are
a special sample type that allow for the creation of a sampling
baseline, to ensure that a session does not obtain counter data from
before its creation.

Upon receipt of a relevant interrupt, corresponding to one of the three
relevant bits of the GLB_ACK register, the sampler takes any samples
that occurred, and, based on the insert and extract indices, accumulates
them to an internal storage buffer after zero-extending the counters
from the 32-bit counters emitted by the hardware to 64-bit counters
for internal accumulation.

When the performance counters are enabled, the FW ensures no counter
data is lost when entering and leaving non-counting regions by producing
automatic samples that do not correspond to a GLB_REQ.PRFCNT_SAMPLE
request. Such regions may be per hardware unit, such as when a shader
core powers down, or global. Most of these events do not directly
correspond to session sample requests, so any intermediary counter data
must be stored into a temporary accumulation buffer.

If there are sessions waiting for a sample, this accumulated buffer will
be taken, and emitted for each waiting client. During this phase,
information like the timestamps of sample request and sample emission,
type of the counter block and block index annotations are added to the
sample header and block headers. If no sessions are waiting for
a sample, this accumulation buffer is kept until the next time a sample
is requested.

Special handling is needed for the PRFCNT_OVERFLOW interrupt, which is
an indication that the internal sample handling rate was insufficient.

The sampler also maintains a buffer descriptor indicating the structure
of a firmware sample, since neither the firmware nor the hardware give
any indication of the sample structure, only that it is composed out of
three parts:
 - the metadata is an optional initial counter block on supporting
   firmware versions that contains a single counter, indicating the
   reason a sample was taken when entering global non-counting regions.
   This is used to provide coarse-grained information about why a sample
   was taken to userspace, to help userspace interpret variations in
   counter magnitude.
 - the firmware component of the sample is composed out of a global
   firmware counter block on supporting firmware versions.
 - the hardware component is the most sizeable of the three and contains
   a block of counters for each of the underlying hardware resources. It
   has a fixed structure that is described in the architecture
   specification, and contains the command stream hardware block(s), the
   tiler block(s), the MMU and L2 blocks (collectively named the memsys
   blocks) and the shader core blocks, in that order.
The structure of this buffer changes based on the firmware and hardware
combination, but is constant on a single system.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
Co-developed-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
---
 drivers/gpu/drm/panthor/panthor_fw.c   |   5 +
 drivers/gpu/drm/panthor/panthor_fw.h   |   9 +-
 drivers/gpu/drm/panthor/panthor_perf.c | 882 ++++++++++++++++++++++++-
 drivers/gpu/drm/panthor/panthor_perf.h |   2 +
 include/uapi/drm/panthor_drm.h         |   5 +-
 5 files changed, 892 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
index e9530d1d9781..cd68870ced18 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -1000,9 +1000,12 @@ static void panthor_fw_init_global_iface(struct panthor_device *ptdev)
 
 	/* Enable interrupts we care about. */
 	glb_iface->input->ack_irq_mask = GLB_CFG_ALLOC_EN |
+					 GLB_PERFCNT_SAMPLE |
 					 GLB_PING |
 					 GLB_CFG_PROGRESS_TIMER |
 					 GLB_CFG_POWEROFF_TIMER |
+					 GLB_PERFCNT_THRESHOLD |
+					 GLB_PERFCNT_OVERFLOW |
 					 GLB_IDLE_EN |
 					 GLB_IDLE;
 
@@ -1031,6 +1034,8 @@ static void panthor_job_irq_handler(struct panthor_device *ptdev, u32 status)
 		return;
 
 	panthor_sched_report_fw_events(ptdev, status);
+
+	panthor_perf_report_irq(ptdev, status);
 }
 PANTHOR_IRQ_HANDLER(job, JOB, panthor_job_irq_handler);
 
diff --git a/drivers/gpu/drm/panthor/panthor_fw.h b/drivers/gpu/drm/panthor/panthor_fw.h
index db10358e24bb..7ed34d2de8b4 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.h
+++ b/drivers/gpu/drm/panthor/panthor_fw.h
@@ -199,9 +199,10 @@ struct panthor_fw_global_control_iface {
 	u32 group_num;
 	u32 group_stride;
 #define GLB_PERFCNT_FW_SIZE(x) ((((x) >> 16) << 8))
+#define GLB_PERFCNT_HW_SIZE(x) (((x) & GENMASK(15, 0)) << 8)
 	u32 perfcnt_size;
 	u32 instr_features;
-#define PERFCNT_FEATURES_MD_SIZE(x) ((x) & GENMASK(3, 0))
+#define PERFCNT_FEATURES_MD_SIZE(x) (((x) & GENMASK(3, 0)) << 8)
 	u32 perfcnt_features;
 };
 
@@ -211,7 +212,7 @@ struct panthor_fw_global_input_iface {
 #define GLB_CFG_ALLOC_EN			BIT(2)
 #define GLB_CFG_POWEROFF_TIMER			BIT(3)
 #define GLB_PROTM_ENTER				BIT(4)
-#define GLB_PERFCNT_EN				BIT(5)
+#define GLB_PERFCNT_ENABLE			BIT(5)
 #define GLB_PERFCNT_SAMPLE			BIT(6)
 #define GLB_COUNTER_EN				BIT(7)
 #define GLB_PING				BIT(8)
@@ -234,7 +235,6 @@ struct panthor_fw_global_input_iface {
 	u32 doorbell_req;
 	u32 reserved1;
 	u32 progress_timer;
-
 #define GLB_TIMER_VAL(x)			((x) & GENMASK(30, 0))
 #define GLB_TIMER_SOURCE_GPU_COUNTER		BIT(31)
 	u32 poweroff_timer;
@@ -244,6 +244,9 @@ struct panthor_fw_global_input_iface {
 	u64 perfcnt_base;
 	u32 perfcnt_extract;
 	u32 reserved3[3];
+#define GLB_PRFCNT_CONFIG_SIZE(x) ((x) & GENMASK(7, 0))
+#define GLB_PRFCNT_CONFIG_SET(x) (((x) & GENMASK(1, 0)) << 8)
+#define GLB_PRFCNT_METADATA_ENABLE BIT(10)
 	u32 perfcnt_config;
 	u32 perfcnt_csg_select;
 	u32 perfcnt_fw_enable;
diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
index 42d8b6f8c45d..d62d97c448da 100644
--- a/drivers/gpu/drm/panthor/panthor_perf.c
+++ b/drivers/gpu/drm/panthor/panthor_perf.c
@@ -15,7 +15,9 @@
 
 #include "panthor_device.h"
 #include "panthor_fw.h"
+#include "panthor_gem.h"
 #include "panthor_gpu.h"
+#include "panthor_mmu.h"
 #include "panthor_perf.h"
 #include "panthor_regs.h"
 
@@ -26,6 +28,41 @@
  */
 #define PANTHOR_PERF_EM_BITS (BITS_PER_TYPE(u64) * 2)
 
+/**
+ * PANTHOR_PERF_FW_RINGBUF_SLOTS - Number of slots allocated for individual samples when configuring
+ *                                 the performance counter ring buffer to firmware. This can be
+ *                                 used to reduce memory consumption on low memory systems.
+ */
+#define PANTHOR_PERF_FW_RINGBUF_SLOTS (32)
+
+/**
+ * PANTHOR_CTR_TIMESTAMP_LO - The first architecturally mandated counter of every block type
+ *                            contains the low 32-bits of the TIMESTAMP value.
+ */
+#define PANTHOR_CTR_TIMESTAMP_LO (0)
+
+/**
+ * PANTHOR_CTR_TIMESTAMP_HI - The register offset containinig the high 32-bits of the TIMESTAMP
+ *                            value.
+ */
+#define PANTHOR_CTR_TIMESTAMP_HI (1)
+
+/**
+ * PANTHOR_CTR_PRFCNT_EN - The register offset containing the enable mask for the enabled counters
+ *                         that were written to memory.
+ */
+#define PANTHOR_CTR_PRFCNT_EN (2)
+
+/**
+ * PANTHOR_HEADER_COUNTERS - The first four counters of every block type are architecturally
+ *                           defined to be equivalent. The fourth counter is always reserved,
+ *                           and should be zero and as such, does not have a separate define.
+ *
+ *                           These are the only four counters that are the same between different
+ *                           blocks and are consistent between different architectures.
+ */
+#define PANTHOR_HEADER_COUNTERS (4)
+
 /**
  * enum panthor_perf_session_state - Session state bits.
  */
@@ -158,6 +195,135 @@ struct panthor_perf_session {
 	struct kref ref;
 };
 
+struct panthor_perf_buffer_descriptor {
+	/**
+	 * @block_size: The size of a single block in the FW ring buffer, equal to
+	 *              sizeof(u32) * counters_per_block.
+	 */
+	size_t block_size;
+
+	/**
+	 * @buffer_size: The total size of the buffer, equal to (#hardware blocks +
+	 *               #firmware blocks) * block_size.
+	 */
+	size_t buffer_size;
+
+	/**
+	 * @available_blocks: Bitmask indicating the blocks supported by the hardware and firmware
+	 *                    combination. Note that this can also include blocks that will not
+	 *                    be exposed to the user.
+	 */
+	DECLARE_BITMAP(available_blocks, DRM_PANTHOR_PERF_BLOCK_MAX);
+	struct {
+		/** @offset: Starting offset of a block of type @type in the FW ringbuffer. */
+		size_t offset;
+
+		/** @type: Type of the blocks between @blocks[i].offset and @blocks[i+1].offset. */
+		enum drm_panthor_perf_block_type type;
+
+		/** @block_count: Number of blocks of the given @type, starting at @offset. */
+		size_t block_count;
+	} blocks[DRM_PANTHOR_PERF_BLOCK_MAX];
+};
+
+
+/**
+ * struct panthor_perf_sampler - Interface to de-multiplex firmware interaction and handle
+ *                               global interactions.
+ */
+struct panthor_perf_sampler {
+	/** @sample_requested: A sample has been requested. */
+	bool sample_requested;
+
+	/**
+	 * @last_ack: Temporarily storing the last GLB_ACK status. Without storing this data,
+	 *            we do not know whether a toggle bit has been handled.
+	 */
+	u32 last_ack;
+
+	/**
+	 * @enabled_clients: The number of clients concurrently requesting samples. To ensure that
+	 *                   one client cannot deny samples to another, we must ensure that clients
+	 *                   are effectively reference counted.
+	 */
+	atomic_t enabled_clients;
+
+	/**
+	 * @sample_handled: Synchronization point between the interrupt bottom half and the
+	 *                  main sampler interface. Must be re-armed solely on a new request
+	 *                  coming to the sampler.
+	 */
+	struct completion sample_handled;
+
+	/** @rb: Kernel BO in the FW AS containing the sample ringbuffer. */
+	struct panthor_kernel_bo *rb;
+
+	/**
+	 * @sample_size: The size of a single sample in the FW ringbuffer. This is computed using
+	 *               the hardware configuration according to the architecture specification,
+	 *               and cross-validated against the sample size reported by FW to ensure
+	 *               a consistent view of the buffer size.
+	 */
+	size_t sample_size;
+
+	/**
+	 * @sample_slots: Number of slots for samples in the FW ringbuffer. Could be static,
+	 *		  but may be useful to customize for low-memory devices.
+	 */
+	size_t sample_slots;
+
+	/**
+	 * @config_lock: Lock serializing changes to the global counter configuration, including
+	 *               requested counter set and the counters themselves.
+	 */
+	struct mutex config_lock;
+
+	/**
+	 * @ems: List of enable maps of the active sessions. When removing a session, the number
+	 *       of requested counters may decrease, and the union of enable masks from the multiple
+	 *       sessions does not provide sufficient information to reconstruct the previous
+	 *       enable mask.
+	 */
+	struct list_head ems;
+
+	/** @em: Combined enable mask for all of the active sessions. */
+	struct panthor_perf_enable_masks *em;
+
+	/**
+	 * @desc: Buffer descriptor for a sample in the FW ringbuffer. Note that this buffer
+	 *        at current time does some interesting things with the zeroth block type. On
+	 *        newer FW revisions, the first counter block of the sample is the METADATA block,
+	 *        which contains a single value indicating the reason the sample was taken (if
+	 *        any). This block must not be exposed to userspace, as userspace does not
+	 *        have sufficient context to interpret it. As such, this block type is not
+	 *        added to the uAPI, but we still use it in the kernel.
+	 */
+	struct panthor_perf_buffer_descriptor desc;
+
+	/**
+	 * @sample: Pointer to an upscaled and annotated sample that may be emitted to userspace.
+	 *          This is used both as an intermediate buffer to do the zero-extension of the
+	 *          32-bit counters to 64-bits and as a storage buffer in case the sampler
+	 *          requests an additional sample that was not requested by any of the top-level
+	 *          sessions (for instance, when changing the enable masks).
+	 */
+	u8 *sample;
+
+	/** @sampler_lock: Lock used to guard the list of sessions requesting samples. */
+	struct mutex sampler_lock;
+
+	/** @sampler_list: List of sessions requesting samples. */
+	struct list_head sampler_list;
+
+	/** @set_config: The set that will be configured onto the hardware. */
+	u8 set_config;
+
+	/**
+	 * @ptdev: Backpointer to the Panthor device, needed to ring the global doorbell and
+	 *         interface with FW.
+	 */
+	struct panthor_device *ptdev;
+};
 
 struct panthor_perf {
 	/**
@@ -175,6 +341,9 @@ struct panthor_perf {
 	 * @sessions: Global map of sessions, accessed by their ID.
 	 */
 	struct xarray sessions;
+
+	/** @sampler: FW control interface. */
+	struct panthor_perf_sampler sampler;
 };
 
 /**
@@ -247,6 +416,23 @@ static struct panthor_perf_enable_masks *panthor_perf_create_em(struct drm_panth
 	return em;
 }
 
+static void panthor_perf_em_add(struct panthor_perf_enable_masks *dst_em,
+		const struct panthor_perf_enable_masks *const src_em)
+{
+	size_t i = 0;
+
+	for (i = DRM_PANTHOR_PERF_BLOCK_FW; i <= DRM_PANTHOR_PERF_BLOCK_LAST; i++)
+		bitmap_or(dst_em->mask[i], dst_em->mask[i], src_em->mask[i], PANTHOR_PERF_EM_BITS);
+}
+
+static void panthor_perf_em_zero(struct panthor_perf_enable_masks *em)
+{
+	size_t i = 0;
+
+	for (i = DRM_PANTHOR_PERF_BLOCK_FW; i <= DRM_PANTHOR_PERF_BLOCK_LAST; i++)
+		bitmap_zero(em->mask[i], PANTHOR_PERF_EM_BITS);
+}
+
 static void panthor_perf_destroy_em_kref(struct kref *em_kref)
 {
 	struct panthor_perf_enable_masks *em = container_of(em_kref, typeof(*em), refs);
@@ -270,6 +456,12 @@ static u32 session_read_extract_idx(struct panthor_perf_session *session)
 	return smp_load_acquire(session->extract_idx);
 }
 
+static void session_write_insert_idx(struct panthor_perf_session *session, u32 idx)
+{
+	/* Userspace needs the insert index to know where to look for the sample. */
+	smp_store_release(session->insert_idx, idx);
+}
+
 static u32 session_read_insert_idx(struct panthor_perf_session *session)
 {
 	return *session->insert_idx;
@@ -349,6 +541,70 @@ static struct panthor_perf_session *session_find(struct panthor_file *pfile,
 	return session;
 }
 
+static u32 compress_enable_mask(unsigned long *const src)
+{
+	size_t i;
+	u32 result = 0;
+	unsigned long clump;
+
+	for_each_set_clump8(i, clump, src, PANTHOR_PERF_EM_BITS) {
+		const unsigned long shift = div_u64(i, 4);
+
+		result |= !!(clump & GENMASK(3, 0)) << shift;
+		result |= !!(clump & GENMASK(7, 4)) << (shift + 1);
+	}
+
+	return result;
+}
+
+static void expand_enable_mask(u32 em, unsigned long *const dst)
+{
+	size_t i;
+	DECLARE_BITMAP(emb, BITS_PER_TYPE(u32));
+
+	bitmap_from_arr32(emb, &em, BITS_PER_TYPE(u32));
+
+	for_each_set_bit(i, emb, BITS_PER_TYPE(u32))
+		bitmap_set(dst, i * 4, 4);
+}
+
+/**
+ * panthor_perf_block_data - Identify the block index and type based on the offset.
+ *
+ * @desc:   FW buffer descriptor.
+ * @offset: The current offset being examined.
+ * @idx:    Pointer to an output index.
+ * @type:   Pointer to an output block type.
+ *
+ * To disambiguate different types of blocks as well as different blocks of the same type,
+ * the offset into the FW ringbuffer is used to uniquely identify the block being considered.
+ *
+ * In the future, this is a good time to identify whether a block will be empty,
+ * allowing us to short-circuit its processing after emitting header information.
+ */
+static void panthor_perf_block_data(struct panthor_perf_buffer_descriptor *const desc,
+		size_t offset, u32 *idx, enum drm_panthor_perf_block_type *type)
+{
+	unsigned long id;
+
+	for_each_set_bit(id, desc->available_blocks, DRM_PANTHOR_PERF_BLOCK_LAST) {
+		const size_t block_start = desc->blocks[id].offset;
+		const size_t block_count = desc->blocks[id].block_count;
+		const size_t block_end = desc->blocks[id].offset +
+			desc->block_size * block_count;
+
+		if (!block_count)
+			continue;
+
+		if ((offset >= block_start) && (offset < block_end)) {
+			*type = desc->blocks[id].type;
+			*idx = div_u64(offset - desc->blocks[id].offset, desc->block_size);
+
+			return;
+		}
+	}
+}
+
 static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *const info)
 {
 	const size_t block_size = get_annotated_block_size(info->counters_per_block);
@@ -358,6 +614,520 @@ static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *co
 	return sizeof(struct drm_panthor_perf_sample_header) + (block_size * block_nr);
 }
 
+static u32 panthor_perf_handle_sample(struct panthor_device *ptdev, u32 extract_idx, u32 insert_idx)
+{
+	struct panthor_perf *perf = ptdev->perf;
+	struct panthor_perf_sampler *sampler = &ptdev->perf->sampler;
+	const size_t ann_block_size =
+		get_annotated_block_size(ptdev->perf_info.counters_per_block);
+	u32 i;
+
+	for (i = extract_idx; i != insert_idx; i = (i + 1) % sampler->sample_slots) {
+		u8 *fw_sample = (u8 *)sampler->rb->kmap + i * sampler->sample_size;
+
+		for (size_t fw_off = 0, ann_off = sizeof(struct drm_panthor_perf_sample_header);
+				fw_off < sampler->desc.buffer_size;
+				fw_off += sampler->desc.block_size)
+
+		{
+			u32 idx;
+			enum drm_panthor_perf_block_type type;
+			DECLARE_BITMAP(expanded_em, PANTHOR_PERF_EM_BITS);
+			struct panthor_perf_counter_block *blk =
+				(typeof(blk))(perf->sampler.sample + ann_off);
+			const u32 prfcnt_en = blk->counters[PANTHOR_CTR_PRFCNT_EN];
+
+			panthor_perf_block_data(&sampler->desc, fw_off, &idx, &type);
+
+			/**
+			 * TODO Data from the metadata block must be used to populate the
+			 * block state information.
+			 */
+			if (type == DRM_PANTHOR_PERF_BLOCK_METADATA)
+				continue;
+
+			expand_enable_mask(prfcnt_en, expanded_em);
+
+			blk->header = (struct drm_panthor_perf_block_header) {
+				.clock = 0,
+				.block_idx = idx,
+				.block_type = type,
+				.block_states = DRM_PANTHOR_PERF_BLOCK_STATE_UNKNOWN
+			};
+			bitmap_to_arr64(blk->header.enable_mask, expanded_em, PANTHOR_PERF_EM_BITS);
+
+			u32 *block = (u32 *)(fw_sample + fw_off);
+
+			/*
+			 * The four header counters must be treated differently, because they are
+			 * not additive. For the fourth, the assignment does not matter, as it
+			 * is reserved and should be zero.
+			 */
+			blk->counters[PANTHOR_CTR_TIMESTAMP_LO] = block[PANTHOR_CTR_TIMESTAMP_LO];
+			blk->counters[PANTHOR_CTR_TIMESTAMP_HI] = block[PANTHOR_CTR_TIMESTAMP_HI];
+			blk->counters[PANTHOR_CTR_PRFCNT_EN] = block[PANTHOR_CTR_PRFCNT_EN];
+
+			for (size_t k = PANTHOR_HEADER_COUNTERS;
+					k < ptdev->perf_info.counters_per_block;
+					k++)
+				blk->counters[k] += block[k];
+
+			ann_off += ann_block_size;
+		}
+	}
+
+	return i;
+}
+
+static size_t panthor_perf_get_fw_reported_size(struct panthor_device *ptdev)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+
+	size_t fw_size = GLB_PERFCNT_FW_SIZE(glb_iface->control->perfcnt_size);
+	size_t hw_size = GLB_PERFCNT_HW_SIZE(glb_iface->control->perfcnt_size);
+	size_t md_size = PERFCNT_FEATURES_MD_SIZE(glb_iface->control->perfcnt_features);
+
+	return md_size + fw_size + hw_size;
+}
+
+#define PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, typ, blk_count, offset) \
+	({ \
+		(desc)->blocks[(typ)].type = (typ); \
+		(desc)->blocks[(typ)].offset = (offset); \
+		(desc)->blocks[(typ)].block_count = (blk_count);  \
+		if ((blk_count))                                    \
+			set_bit((typ), (desc)->available_blocks); \
+		(offset) + ((desc)->block_size) * (blk_count); \
+	 })
+
+static int panthor_perf_setup_fw_buffer_desc(struct panthor_device *ptdev,
+		struct panthor_perf_sampler *sampler)
+{
+	const struct drm_panthor_perf_info *const info = &ptdev->perf_info;
+	const size_t block_size = info->counters_per_block * sizeof(u32);
+	struct panthor_perf_buffer_descriptor *desc = &sampler->desc;
+	const size_t fw_sample_size = panthor_perf_get_fw_reported_size(ptdev);
+	size_t offset = 0;
+
+	desc->block_size = block_size;
+
+	for (enum drm_panthor_perf_block_type type = 0; type < DRM_PANTHOR_PERF_BLOCK_MAX; type++) {
+		switch (type) {
+		case DRM_PANTHOR_PERF_BLOCK_METADATA:
+			if (info->flags & DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT)
+				offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc,
+					DRM_PANTHOR_PERF_BLOCK_METADATA, 1, offset);
+			break;
+		case DRM_PANTHOR_PERF_BLOCK_FW:
+			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->fw_blocks,
+					offset);
+			break;
+		case DRM_PANTHOR_PERF_BLOCK_CSG:
+			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->csg_blocks,
+					offset);
+			break;
+		case DRM_PANTHOR_PERF_BLOCK_CSHW:
+			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->cshw_blocks,
+					offset);
+			break;
+		case DRM_PANTHOR_PERF_BLOCK_TILER:
+			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->tiler_blocks,
+					offset);
+			break;
+		case DRM_PANTHOR_PERF_BLOCK_MEMSYS:
+			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->memsys_blocks,
+					offset);
+			break;
+		case DRM_PANTHOR_PERF_BLOCK_SHADER:
+			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->shader_blocks,
+					offset);
+			break;
+		case DRM_PANTHOR_PERF_BLOCK_MAX:
+			drm_WARN_ON_ONCE(&ptdev->base,
+					"DRM_PANTHOR_PERF_BLOCK_MAX should be unreachable!");
+			break;
+		}
+	}
+
+	/* Computed size is not the same as the reported size, so we should not proceed in
+	 * initializing the sampling session.
+	 */
+	if (offset != fw_sample_size)
+		return -EINVAL;
+
+	desc->buffer_size = offset;
+
+	return 0;
+}
+
+static int panthor_perf_fw_stop_sampling(struct panthor_device *ptdev)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+	u32 acked;
+	int ret;
+
+	if (~READ_ONCE(glb_iface->input->req) & GLB_PERFCNT_ENABLE)
+		return 0;
+
+	panthor_fw_update_reqs(glb_iface, req, 0, GLB_PERFCNT_ENABLE);
+	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
+	ret = panthor_fw_glb_wait_acks(ptdev, GLB_PERFCNT_ENABLE, &acked, 100);
+	if (ret)
+		drm_warn(&ptdev->base, "Could not disable performance counters");
+
+	return ret;
+}
+
+static int panthor_perf_fw_start_sampling(struct panthor_device *ptdev)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+	u32 acked;
+	int ret;
+
+	if (READ_ONCE(glb_iface->input->req) & GLB_PERFCNT_ENABLE)
+		return 0;
+
+	panthor_fw_update_reqs(glb_iface, req, GLB_PERFCNT_ENABLE, GLB_PERFCNT_ENABLE);
+	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
+	ret = panthor_fw_glb_wait_acks(ptdev, GLB_PERFCNT_ENABLE, &acked, 100);
+	if (ret)
+		drm_warn(&ptdev->base, "Could not enable performance counters");
+
+	return ret;
+}
+
+static void panthor_perf_fw_write_em(struct panthor_perf_sampler *sampler,
+		struct panthor_perf_enable_masks *em)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(sampler->ptdev);
+	u32 perfcnt_config;
+
+	glb_iface->input->perfcnt_csf_enable =
+		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_CSHW]);
+	glb_iface->input->perfcnt_shader_enable =
+		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_SHADER]);
+	glb_iface->input->perfcnt_mmu_l2_enable =
+		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_MEMSYS]);
+	glb_iface->input->perfcnt_tiler_enable =
+		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_TILER]);
+	glb_iface->input->perfcnt_fw_enable =
+		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_FW]);
+	glb_iface->input->perfcnt_csg_enable =
+		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_CSG]);
+
+	perfcnt_config = GLB_PRFCNT_CONFIG_SIZE(PANTHOR_PERF_FW_RINGBUF_SLOTS);
+	perfcnt_config |= GLB_PRFCNT_CONFIG_SET(sampler->set_config);
+	glb_iface->input->perfcnt_config = perfcnt_config;
+
+	/**
+	 * The spec mandates that the host zero the PRFCNT_EXTRACT register before an enable
+	 * operation, and each (re-)enable will require an enable-disable pair to program
+	 * the new changes onto the FW interface.
+	 */
+	WRITE_ONCE(glb_iface->input->perfcnt_extract, 0);
+}
+
+static void session_populate_sample_header(struct panthor_perf_session *session,
+		struct drm_panthor_perf_sample_header *hdr)
+{
+	hdr->block_set = 0;
+	hdr->user_data = session->user_data;
+	hdr->timestamp_start_ns = session->sample_start_ns;
+	/**
+	 * TODO This should be changed to use the GPU clocks and the TIMESTAMP register,
+	 * when support is added.
+	 */
+	hdr->timestamp_end_ns = ktime_get_raw_ns();
+}
+
+/**
+ * session_patch_sample - Update the PRFCNT_EN header counter and the counters exposed to the
+ *                        userspace client to only contain requested counters.
+ *
+ * @ptdev: Panthor device
+ * @session: Perf session
+ * @sample: Starting offset of the sample in the userspace mapping.
+ *
+ * The hardware supports counter selection at the granularity of 1 bit per 4 counters, and there
+ * is a single global FW frontend to program the counter requests from multiple sessions. This may
+ * lead to a large disparity between the requested and provided counters for an individual client.
+ * To remove this cross-talk, we patch out the counters that have not been requested by this
+ * session and update the PRFCNT_EN, the header counter containing a bitmask of enabled counters,
+ * accordingly.
+ */
+static void session_patch_sample(struct panthor_device *ptdev,
+		struct panthor_perf_session *session, u8 *sample)
+{
+	const struct drm_panthor_perf_info *const perf_info = &ptdev->perf_info;
+
+	const size_t block_size = get_annotated_block_size(perf_info->counters_per_block);
+	const size_t sample_size = session_get_max_sample_size(perf_info);
+
+	for (size_t i = 0; i < sample_size; i += block_size) {
+		size_t ctr_idx;
+		DECLARE_BITMAP(em_diff, PANTHOR_PERF_EM_BITS);
+		struct panthor_perf_counter_block *blk = (typeof(blk))(sample + block_size);
+		enum drm_panthor_perf_block_type type = blk->header.block_type;
+		unsigned long *blk_em = session->enabled_counters->mask[type];
+
+		bitmap_from_arr64(em_diff, blk->header.enable_mask, PANTHOR_PERF_EM_BITS);
+
+		bitmap_andnot(em_diff, em_diff, blk_em, PANTHOR_PERF_EM_BITS);
+
+		for_each_set_bit(ctr_idx, em_diff, PANTHOR_PERF_EM_BITS)
+			blk->counters[ctr_idx] = 0;
+
+		bitmap_to_arr64(blk->header.enable_mask, blk_em, PANTHOR_PERF_EM_BITS);
+	}
+}
+
+static int session_copy_sample(struct panthor_device *ptdev,
+		struct panthor_perf_session *session)
+{
+	struct panthor_perf *perf = ptdev->perf;
+	const size_t sample_size = session_get_max_sample_size(&ptdev->perf_info);
+	const u32 insert_idx = session_read_insert_idx(session);
+	const u32 extract_idx = session_read_extract_idx(session);
+	u8 *new_sample;
+
+	if (!CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots))
+		return -ENOSPC;
+
+	new_sample = session->samples + extract_idx * sample_size;
+
+	memcpy(new_sample, perf->sampler.sample, sample_size);
+
+	session_populate_sample_header(session,
+			(struct drm_panthor_perf_sample_header *)new_sample);
+
+	session_patch_sample(ptdev, session, new_sample +
+			sizeof(struct drm_panthor_perf_sample_header));
+
+	session_write_insert_idx(session, (insert_idx + 1) % session->ringbuf_slots);
+
+	/* Since we are about to notify userspace, we must ensure that all changes to memory
+	 * are visible.
+	 */
+	wmb();
+
+	eventfd_signal(session->eventfd);
+
+	return 0;
+}
+
+#define PERFCNT_IRQS (GLB_PERFCNT_OVERFLOW | GLB_PERFCNT_SAMPLE | GLB_PERFCNT_THRESHOLD)
+
+void panthor_perf_report_irq(struct panthor_device *ptdev, u32 status)
+{
+	struct panthor_perf *const perf = ptdev->perf;
+	struct panthor_perf_sampler *sampler;
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+
+	if (!(status & JOB_INT_GLOBAL_IF))
+		return;
+
+	if (!perf)
+		return;
+
+	sampler = &perf->sampler;
+
+	/* TODO This needs locking. */
+	const u32 ack = READ_ONCE(glb_iface->output->ack);
+	const u32 fw_events = sampler->last_ack ^ ack;
+
+	sampler->last_ack = ack;
+
+	if (!(fw_events & PERFCNT_IRQS))
+		return;
+
+	/* TODO Fix up the error handling for overflow. */
+	if (fw_events & GLB_PERFCNT_OVERFLOW)
+		return;
+
+	if (fw_events & (GLB_PERFCNT_SAMPLE | GLB_PERFCNT_THRESHOLD)) {
+		const u32 extract_idx = READ_ONCE(glb_iface->input->perfcnt_extract);
+		const u32 insert_idx = READ_ONCE(glb_iface->output->perfcnt_insert);
+
+		WRITE_ONCE(glb_iface->input->perfcnt_extract,
+				panthor_perf_handle_sample(ptdev, extract_idx, insert_idx));
+	}
+
+	scoped_guard(mutex, &sampler->sampler_lock)
+	{
+		struct list_head *pos, *temp;
+
+		list_for_each_safe(pos, temp, &sampler->sampler_list) {
+			struct panthor_perf_session *session = list_entry(pos,
+					struct panthor_perf_session, waiting);
+
+			session_copy_sample(ptdev, session);
+			list_del_init(pos);
+
+			session_put(session);
+		}
+	}
+
+	memset(sampler->sample, 0, session_get_max_sample_size(&ptdev->perf_info));
+	sampler->sample_requested = false;
+	complete(&sampler->sample_handled);
+}
+
+
+static int panthor_perf_sampler_init(struct panthor_perf_sampler *sampler,
+		struct panthor_device *ptdev)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+	struct panthor_kernel_bo *bo;
+	u8 *sample;
+	int ret;
+
+	ret = panthor_perf_setup_fw_buffer_desc(ptdev, sampler);
+	if (ret) {
+		drm_err(&ptdev->base,
+				"Failed to setup descriptor for FW ring buffer, err = %d", ret);
+		return ret;
+	}
+
+	bo = panthor_kernel_bo_create(ptdev, panthor_fw_vm(ptdev),
+			sampler->desc.buffer_size * PANTHOR_PERF_FW_RINGBUF_SLOTS,
+			DRM_PANTHOR_BO_NO_MMAP,
+			DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
+			PANTHOR_VM_KERNEL_AUTO_VA);
+
+	if (IS_ERR_OR_NULL(bo))
+		return IS_ERR(bo) ? PTR_ERR(bo) : -ENOMEM;
+
+	ret = panthor_kernel_bo_vmap(bo);
+	if (ret)
+		goto cleanup_bo;
+
+	sample = devm_kzalloc(ptdev->base.dev,
+			session_get_max_sample_size(&ptdev->perf_info), GFP_KERNEL);
+	if (ZERO_OR_NULL_PTR(sample)) {
+		ret = -ENOMEM;
+		goto cleanup_vmap;
+	}
+
+	glb_iface->input->perfcnt_as = panthor_vm_as(panthor_fw_vm(ptdev));
+	glb_iface->input->perfcnt_base = panthor_kernel_bo_gpuva(bo);
+	glb_iface->input->perfcnt_extract = 0;
+	glb_iface->input->perfcnt_csg_select = GENMASK(glb_iface->control->group_num, 0);
+
+	sampler->rb = bo;
+	sampler->sample = sample;
+	sampler->sample_slots = PANTHOR_PERF_FW_RINGBUF_SLOTS;
+
+	sampler->em = panthor_perf_em_new();
+
+	mutex_init(&sampler->sampler_lock);
+	mutex_init(&sampler->config_lock);
+	INIT_LIST_HEAD(&sampler->sampler_list);
+	INIT_LIST_HEAD(&sampler->ems);
+	init_completion(&sampler->sample_handled);
+
+	sampler->ptdev = ptdev;
+
+	return 0;
+
+cleanup_vmap:
+	panthor_kernel_bo_vunmap(bo);
+
+cleanup_bo:
+	panthor_kernel_bo_destroy(bo);
+
+	return ret;
+}
+
+static void panthor_perf_sampler_term(struct panthor_perf_sampler *sampler)
+{
+	int ret;
+
+	if (sampler->sample_requested)
+		wait_for_completion_killable(&sampler->sample_handled);
+
+	panthor_perf_fw_write_em(sampler, &(struct panthor_perf_enable_masks) {});
+
+	ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
+	if (ret)
+		drm_warn_once(&sampler->ptdev->base, "Sampler termination failed, ret = %d", ret);
+
+	devm_kfree(sampler->ptdev->base.dev, sampler->sample);
+
+	panthor_kernel_bo_destroy(sampler->rb);
+}
+
+static int panthor_perf_sampler_add(struct panthor_perf_sampler *sampler,
+		struct panthor_perf_enable_masks *const new_em,
+		u8 set)
+{
+	int ret = 0;
+
+	guard(mutex)(&sampler->config_lock);
+
+	/* Early check for whether a new set can be configured. */
+	if (!atomic_read(&sampler->enabled_clients))
+		sampler->set_config = set;
+	else
+		if (sampler->set_config != set)
+			return -EBUSY;
+
+	kref_get(&new_em->refs);
+	list_add_tail(&sampler->ems, &new_em->link);
+
+	panthor_perf_em_add(sampler->em, new_em);
+	pm_runtime_get_sync(sampler->ptdev->base.dev);
+
+	if (atomic_read(&sampler->enabled_clients)) {
+		ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
+		if (ret)
+			return ret;
+	}
+
+	panthor_perf_fw_write_em(sampler, sampler->em);
+
+	ret = panthor_perf_fw_start_sampling(sampler->ptdev);
+	if (ret)
+		return ret;
+
+	atomic_inc(&sampler->enabled_clients);
+
+	return 0;
+}
+
+static int panthor_perf_sampler_remove(struct panthor_perf_sampler *sampler,
+		struct panthor_perf_enable_masks *session_em)
+{
+	int ret;
+	struct list_head *em_node;
+
+	guard(mutex)(&sampler->config_lock);
+
+	list_del_init(&session_em->link);
+	kref_put(&session_em->refs, panthor_perf_destroy_em_kref);
+
+	panthor_perf_em_zero(sampler->em);
+	list_for_each(em_node, &sampler->ems)
+	{
+		struct panthor_perf_enable_masks *curr_em =
+			container_of(em_node, typeof(*curr_em), link);
+
+		panthor_perf_em_add(sampler->em, curr_em);
+	}
+
+	ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
+	if (ret)
+		return ret;
+
+	atomic_dec(&sampler->enabled_clients);
+	pm_runtime_put_sync(sampler->ptdev->base.dev);
+
+	panthor_perf_fw_write_em(sampler, sampler->em);
+
+	if (atomic_read(&sampler->enabled_clients))
+		return panthor_perf_fw_start_sampling(sampler->ptdev);
+	return 0;
+}
+
 /**
  * panthor_perf_init - Initialize the performance counter subsystem.
  * @ptdev: Panthor device
@@ -370,6 +1140,7 @@ static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *co
 int panthor_perf_init(struct panthor_device *ptdev)
 {
 	struct panthor_perf *perf;
+	int ret;
 
 	if (!ptdev)
 		return -EINVAL;
@@ -386,12 +1157,93 @@ int panthor_perf_init(struct panthor_device *ptdev)
 		.max = 1,
 	};
 
+	ret = panthor_perf_sampler_init(&perf->sampler, ptdev);
+	if (ret)
+		goto cleanup_perf;
+
 	drm_info(&ptdev->base, "Performance counter subsystem initialized");
 
 	ptdev->perf = perf;
 
-	return 0;
+	return ret;
+
+cleanup_perf:
+	devm_kfree(ptdev->base.dev, perf);
+
+	return ret;
+}
+
+
+static void panthor_perf_fw_request_sample(struct panthor_perf_sampler *sampler)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(sampler->ptdev);
+
+	panthor_fw_toggle_reqs(glb_iface, req, ack, GLB_PERFCNT_SAMPLE);
+	gpu_write(sampler->ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
+}
+
+/**
+ * panthor_perf_sampler_request_clearing - Request a clearing sample.
+ * @sampler: Panthor sampler
+ *
+ * Perform a synchronous sample that gets immediately discarded. This sets a baseline at the point
+ * of time a new session is started, to avoid having counters from before the session.
+ *
+ */
+static int panthor_perf_sampler_request_clearing(struct panthor_perf_sampler *sampler)
+{
+	scoped_guard(mutex, &sampler->sampler_lock) {
+		if (!sampler->sample_requested) {
+			panthor_perf_fw_request_sample(sampler);
+			sampler->sample_requested = true;
+		}
+	}
+
+	return wait_for_completion_timeout(&sampler->sample_handled,
+			msecs_to_jiffies(1000));
+}
+
+/**
+ * panthor_perf_sampler_request_sample - Request a counter sample for the userspace client.
+ * @sampler: Panthor sampler
+ * @session: Target session
+ *
+ * A session that has already requested a sample cannot request another one until the previous
+ * sample has been delivered.
+ *
+ * Return:
+ * * %0       - The sample has been requested successfully.
+ * * %-EBUSY  - The target session has already requested a sample and has not received it yet.
+ */
+static int panthor_perf_sampler_request_sample(struct panthor_perf_sampler *sampler,
+		struct panthor_perf_session *session)
+{
+	struct list_head *head;
+
+	reinit_completion(&sampler->sample_handled);
+
+	guard(mutex)(&sampler->sampler_lock);
+
+	/*
+	 * If a previous sample has not been handled yet, the session cannot request another
+	 * sample. If this happens too often, the requested sample rate is too high.
+	 */
+	list_for_each(head, &sampler->sampler_list) {
+		struct panthor_perf_session *cur_session = list_entry(head,
+				typeof(*cur_session), waiting);
+
+		if (session == cur_session)
+			return -EBUSY;
+	}
+
+	if (list_empty(&sampler->sampler_list) && !sampler->sample_requested)
+		panthor_perf_fw_request_sample(sampler);
 
+	sampler->sample_requested = true;
+	list_add_tail(&session->waiting, &sampler->sampler_list);
+	session_get(session);
+
+	return 0;
 }
 
 static int session_validate_set(u8 set)
@@ -483,7 +1335,12 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
 		goto cleanup_eventfd;
 	}
 
+	ret = panthor_perf_sampler_add(&perf->sampler, em, setup_args->block_set);
+	if (ret)
+		goto cleanup_em;
+
 	INIT_LIST_HEAD(&session->waiting);
+
 	session->extract_idx = ctrl_map.vaddr;
 	*session->extract_idx = 0;
 	session->insert_idx = session->extract_idx + 1;
@@ -507,12 +1364,15 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
 	ret = xa_alloc_cyclic(&perf->sessions, &session_id, session, perf->session_range,
 			&perf->next_session, GFP_KERNEL);
 	if (ret < 0)
-		goto cleanup_em;
+		goto cleanup_sampler_add;
 
 	kref_init(&session->ref);
 
 	return session_id;
 
+cleanup_sampler_add:
+	panthor_perf_sampler_remove(&perf->sampler, em);
+
 cleanup_em:
 	kref_put(&em->refs, panthor_perf_destroy_em_kref);
 
@@ -540,6 +1400,8 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
 static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *session,
 		u64 user_data)
 {
+	int ret;
+
 	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
 		return 0;
 
@@ -552,6 +1414,10 @@ static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *
 
 	session->user_data = user_data;
 
+	ret = panthor_perf_sampler_request_sample(&perf->sampler, session);
+	if (ret)
+		return ret;
+
 	clear_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
 
 	/* TODO Calls to the FW interface will go here in later patches. */
@@ -573,8 +1439,7 @@ static int session_start(struct panthor_perf *perf, struct panthor_perf_session
 	if (session->sample_freq_ns)
 		session->user_data = user_data;
 
-	/* TODO Calls to the FW interface will go here in later patches. */
-	return 0;
+	return panthor_perf_sampler_request_clearing(&perf->sampler);
 }
 
 static int session_sample(struct panthor_perf *perf, struct panthor_perf_session *session,
@@ -601,15 +1466,16 @@ static int session_sample(struct panthor_perf *perf, struct panthor_perf_session
 	session->sample_start_ns = ktime_get_raw_ns();
 	session->user_data = user_data;
 
-	/* TODO Calls to the FW interface will go here in later patches. */
-	return 0;
+	return panthor_perf_sampler_request_sample(&perf->sampler, session);
 }
 
 static int session_destroy(struct panthor_perf *perf, struct panthor_perf_session *session)
 {
+	int ret = panthor_perf_sampler_remove(&perf->sampler, session->enabled_counters);
+
 	session_put(session);
 
-	return 0;
+	return ret;
 }
 
 static int session_teardown(struct panthor_perf *perf, struct panthor_perf_session *session)
@@ -813,6 +1679,8 @@ void panthor_perf_unplug(struct panthor_device *ptdev)
 
 	xa_destroy(&perf->sessions);
 
+	panthor_perf_sampler_term(&perf->sampler);
+
 	devm_kfree(ptdev->base.dev, ptdev->perf);
 
 	ptdev->perf = NULL;
diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
index bfef8874068b..3485e4a55e15 100644
--- a/drivers/gpu/drm/panthor/panthor_perf.h
+++ b/drivers/gpu/drm/panthor/panthor_perf.h
@@ -31,4 +31,6 @@ int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf
 		u32 sid, u64 user_data);
 void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf);
 
+void panthor_perf_report_irq(struct panthor_device *ptdev, u32 status);
+
 #endif /* __PANTHOR_PERF_H__ */
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index 576d3ad46e6d..a29b755d6556 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -441,8 +441,11 @@ enum drm_panthor_perf_feat_flags {
  * enum drm_panthor_perf_block_type - Performance counter supported block types.
  */
 enum drm_panthor_perf_block_type {
+	/** DRM_PANTHOR_PERF_BLOCK_METADATA: Internal use only. */
+	DRM_PANTHOR_PERF_BLOCK_METADATA = 0,
+
 	/** @DRM_PANTHOR_PERF_BLOCK_FW: The FW counter block. */
-	DRM_PANTHOR_PERF_BLOCK_FW = 1,
+	DRM_PANTHOR_PERF_BLOCK_FW,
 
 	/** @DRM_PANTHOR_PERF_BLOCK_CSG: A CSG counter block. */
 	DRM_PANTHOR_PERF_BLOCK_CSG,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v2 7/8] drm/panthor: Add suspend/resume handling for the performance counters
  2024-12-11 16:50 [RFC v2 0/8] drm/panthor: Add performance counters with manual sampling mode Lukas Zapolskas
                   ` (5 preceding siblings ...)
  2024-12-11 16:50 ` [RFC v2 6/8] drm/panthor: Implement the counter sampler and sample handling Lukas Zapolskas
@ 2024-12-11 16:50 ` Lukas Zapolskas
  2025-01-27 20:06   ` Adrián Larumbe
  2024-12-11 16:50 ` [RFC v2 8/8] drm/panthor: Expose the panthor perf ioctls Lukas Zapolskas
  7 siblings, 1 reply; 28+ messages in thread
From: Lukas Zapolskas @ 2024-12-11 16:50 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Adrián Larumbe
  Cc: dri-devel, linux-kernel, Mihail Atanassov, nd, Lukas Zapolskas

Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
---
 drivers/gpu/drm/panthor/panthor_device.c |  3 +
 drivers/gpu/drm/panthor/panthor_perf.c   | 86 ++++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_perf.h   |  2 +
 3 files changed, 91 insertions(+)

diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 1a81a436143b..69536fbdb5ef 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -475,6 +475,7 @@ int panthor_device_resume(struct device *dev)
 		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
 		if (!ret) {
 			panthor_sched_resume(ptdev);
+			panthor_perf_resume(ptdev);
 		} else {
 			panthor_mmu_suspend(ptdev);
 			panthor_gpu_suspend(ptdev);
@@ -543,6 +544,7 @@ int panthor_device_suspend(struct device *dev)
 	    drm_dev_enter(&ptdev->base, &cookie)) {
 		cancel_work_sync(&ptdev->reset.work);
 
+		panthor_perf_suspend(ptdev);
 		/* We prepare everything as if we were resetting the GPU.
 		 * The end of the reset will happen in the resume path though.
 		 */
@@ -561,6 +563,7 @@ int panthor_device_suspend(struct device *dev)
 			panthor_mmu_resume(ptdev);
 			drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
 			panthor_sched_resume(ptdev);
+			panthor_perf_resume(ptdev);
 			drm_dev_exit(cookie);
 		}
 
diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
index d62d97c448da..727e66074eab 100644
--- a/drivers/gpu/drm/panthor/panthor_perf.c
+++ b/drivers/gpu/drm/panthor/panthor_perf.c
@@ -433,6 +433,17 @@ static void panthor_perf_em_zero(struct panthor_perf_enable_masks *em)
 		bitmap_zero(em->mask[i], PANTHOR_PERF_EM_BITS);
 }
 
+static bool panthor_perf_em_empty(const struct panthor_perf_enable_masks *const em)
+{
+	bool empty = true;
+	size_t i = 0;
+
+	for (i = DRM_PANTHOR_PERF_BLOCK_FW; i <= DRM_PANTHOR_PERF_BLOCK_LAST; i++)
+		empty &= bitmap_empty(em->mask[i], PANTHOR_PERF_EM_BITS);
+
+	return empty;
+}
+
 static void panthor_perf_destroy_em_kref(struct kref *em_kref)
 {
 	struct panthor_perf_enable_masks *em = container_of(em_kref, typeof(*em), refs);
@@ -1652,6 +1663,81 @@ void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_per
 	}
 }
 
+static int panthor_perf_sampler_resume(struct panthor_perf_sampler *sampler)
+{
+	int ret;
+
+	if (!atomic_read(&sampler->enabled_clients))
+		return 0;
+
+	if (!panthor_perf_em_empty(sampler->em)) {
+		guard(mutex)(&sampler->config_lock);
+		panthor_perf_fw_write_em(sampler, sampler->em);
+	}
+
+	ret = panthor_perf_fw_start_sampling(sampler->ptdev);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int panthor_perf_sampler_suspend(struct panthor_perf_sampler *sampler)
+{
+	int ret;
+
+	if (!atomic_read(&sampler->enabled_clients))
+		return 0;
+
+	ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+/**
+ * panthor_perf_suspend - Prepare the performance counter subsystem for system suspend.
+ * @ptdev: Panthor device.
+ *
+ * Indicate to the performance counters that the system is suspending.
+ *
+ * This function must not be used to handle MCU power state transitions: just before MCU goes
+ * from on to any inactive state, an automatic sample will be performed by the firmware, and
+ * the performance counter firmware state will be restored on warm boot.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int panthor_perf_suspend(struct panthor_device *ptdev)
+{
+	struct panthor_perf *perf = ptdev->perf;
+
+	if (!perf)
+		return 0;
+
+	return panthor_perf_sampler_suspend(&perf->sampler);
+}
+
+/**
+ * panthor_perf_resume - Resume the performance counter subsystem after system resumption.
+ * @ptdev: Panthor device.
+ *
+ * Indicate to the performance counters that the system has resumed. This must not be used
+ * to handle MCU state transitions, for the same reasons as detailed in the kerneldoc for
+ * @panthor_perf_suspend.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int panthor_perf_resume(struct panthor_device *ptdev)
+{
+	struct panthor_perf *perf = ptdev->perf;
+
+	if (!perf)
+		return 0;
+
+	return panthor_perf_sampler_resume(&perf->sampler);
+}
+
 /**
  * panthor_perf_unplug - Terminate the performance counter subsystem.
  * @ptdev: Panthor device.
diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
index 3485e4a55e15..a22a511a0809 100644
--- a/drivers/gpu/drm/panthor/panthor_perf.h
+++ b/drivers/gpu/drm/panthor/panthor_perf.h
@@ -16,6 +16,8 @@ struct panthor_perf;
 void panthor_perf_info_init(struct panthor_device *ptdev);
 
 int panthor_perf_init(struct panthor_device *ptdev);
+int panthor_perf_suspend(struct panthor_device *ptdev);
+int panthor_perf_resume(struct panthor_device *ptdev);
 void panthor_perf_unplug(struct panthor_device *ptdev);
 
 int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC v2 8/8] drm/panthor: Expose the panthor perf ioctls
  2024-12-11 16:50 [RFC v2 0/8] drm/panthor: Add performance counters with manual sampling mode Lukas Zapolskas
                   ` (6 preceding siblings ...)
  2024-12-11 16:50 ` [RFC v2 7/8] drm/panthor: Add suspend/resume handling for the performance counters Lukas Zapolskas
@ 2024-12-11 16:50 ` Lukas Zapolskas
  2025-01-27 20:14   ` Adrián Larumbe
  7 siblings, 1 reply; 28+ messages in thread
From: Lukas Zapolskas @ 2024-12-11 16:50 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Adrián Larumbe
  Cc: dri-devel, linux-kernel, Mihail Atanassov, nd, Lukas Zapolskas

Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
---
 drivers/gpu/drm/panthor/panthor_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index 2848ab442d10..ef081a383fa9 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1654,6 +1654,8 @@ static void panthor_debugfs_init(struct drm_minor *minor)
  * - 1.1 - adds DEV_QUERY_TIMESTAMP_INFO query
  * - 1.2 - adds DEV_QUERY_GROUP_PRIORITIES_INFO query
  *       - adds PANTHOR_GROUP_PRIORITY_REALTIME priority
+ * - 1.3 - adds DEV_QUERY_PERF_INFO query
+ *         adds PERF_CONTROL ioctl
  */
 static const struct drm_driver panthor_drm_driver = {
 	.driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
@@ -1667,7 +1669,7 @@ static const struct drm_driver panthor_drm_driver = {
 	.name = "panthor",
 	.desc = "Panthor DRM driver",
 	.major = 1,
-	.minor = 2,
+	.minor = 3,
 
 	.gem_create_object = panthor_gem_create_object,
 	.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC v2 1/8] drm/panthor: Add performance counter uAPI
  2024-12-11 16:50 ` [RFC v2 1/8] drm/panthor: Add performance counter uAPI Lukas Zapolskas
@ 2025-01-27  9:47   ` Adrián Larumbe
  2025-03-26 14:24     ` Lukas Zapolskas
  0 siblings, 1 reply; 28+ messages in thread
From: Adrián Larumbe @ 2025-01-27  9:47 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

Hi Lukas,

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> This patch extends the DEV_QUERY ioctl to return information about the
> performance counter setup for userspace, and introduces the new
> ioctl DRM_PANTHOR_PERF_CONTROL in order to allow for the sampling of
> performance counters.
> 
> The new design is inspired by the perf aux ringbuffer, with the insert
> and extract indices being mapped to userspace, allowing multiple samples
> to be exposed at any given time. To avoid pointer chasing, the sample
> metadata and block metadata are inline with the elements they
> describe.
> 
> Userspace is responsible for passing in resources for samples to be
> exposed, including the event file descriptor for notification of new
> sample availability, the ringbuffer BO to store samples, and the control
> BO along with the offset for mapping the insert and extract indices.
> Though these indices are only a total of 8 bytes, userspace can then
> reuse the same physical page for tracking the state of multiple buffers
> by giving different offsets from the BO start to map them.
> 
> Co-developed-by: Mihail Atanassov <mihail.atanassov@arm.com>
> Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  include/uapi/drm/panthor_drm.h | 487 +++++++++++++++++++++++++++++++++
>  1 file changed, 487 insertions(+)
> 
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> index 87c9cb555dd1..8a431431da6b 100644
> --- a/include/uapi/drm/panthor_drm.h
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -127,6 +127,9 @@ enum drm_panthor_ioctl_id {
>  
>  	/** @DRM_PANTHOR_TILER_HEAP_DESTROY: Destroy a tiler heap. */
>  	DRM_PANTHOR_TILER_HEAP_DESTROY,
> +
> +	/** @DRM_PANTHOR_PERF_CONTROL: Control a performance counter session. */
> +	DRM_PANTHOR_PERF_CONTROL,
>  };
>  
>  /**
> @@ -170,6 +173,8 @@ enum drm_panthor_ioctl_id {
>  	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_CREATE, tiler_heap_create)
>  #define DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY \
>  	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_DESTROY, tiler_heap_destroy)
> +#define DRM_IOCTL_PANTHOR_PERF_CONTROL \
> +	DRM_IOCTL_PANTHOR(WR, PERF_CONTROL, perf_control)
>  
>  /**
>   * DOC: IOCTL arguments
> @@ -268,6 +273,9 @@ enum drm_panthor_dev_query_type {
>  	 * @DRM_PANTHOR_DEV_QUERY_GROUP_PRIORITIES_INFO: Query allowed group priorities information.
>  	 */
>  	DRM_PANTHOR_DEV_QUERY_GROUP_PRIORITIES_INFO,
> +
> +	/** @DRM_PANTHOR_DEV_QUERY_PERF_INFO: Query performance counter interface information. */
> +	DRM_PANTHOR_DEV_QUERY_PERF_INFO,
>  };
>  
>  /**
> @@ -421,6 +429,120 @@ struct drm_panthor_group_priorities_info {
>  	__u8 pad[3];
>  };
>  
> +/**
> + * enum drm_panthor_perf_feat_flags - Performance counter configuration feature flags.
> + */
> +enum drm_panthor_perf_feat_flags {
> +	/** @DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT: Coarse-grained block states are supported. */
> +	DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT = 1 << 0,
> +};
> +
> +/**
> + * enum drm_panthor_perf_block_type - Performance counter supported block types.
> + */
> +enum drm_panthor_perf_block_type {
> +	/** @DRM_PANTHOR_PERF_BLOCK_FW: The FW counter block. */
> +	DRM_PANTHOR_PERF_BLOCK_FW = 1,
> +
> +	/** @DRM_PANTHOR_PERF_BLOCK_CSG: A CSG counter block. */
> +	DRM_PANTHOR_PERF_BLOCK_CSG,
> +
> +	/** @DRM_PANTHOR_PERF_BLOCK_CSHW: The CSHW counter block. */
> +	DRM_PANTHOR_PERF_BLOCK_CSHW,
> +
> +	/** @DRM_PANTHOR_PERF_BLOCK_TILER: The tiler counter block. */
> +	DRM_PANTHOR_PERF_BLOCK_TILER,
> +
> +	/** @DRM_PANTHOR_PERF_BLOCK_MEMSYS: A memsys counter block. */
> +	DRM_PANTHOR_PERF_BLOCK_MEMSYS,
> +
> +	/** @DRM_PANTHOR_PERF_BLOCK_SHADER: A shader core counter block. */
> +	DRM_PANTHOR_PERF_BLOCK_SHADER,
> +};
> +
> +/**
> + * enum drm_panthor_perf_clock - Identifier of the clock used to produce the cycle count values
> + * in a given block.
> + *
> + * Since the integrator has the choice of using one or more clocks, there may be some confusion
> + * as to which blocks are counted by which clock values unless this information is explicitly
> + * provided as part of every block sample. Not every single clock here can be used: in the simplest
> + * case, all cycle counts will be associated with the top-level clock.
> + */
> +enum drm_panthor_perf_clock {
> +	/** @DRM_PANTHOR_PERF_CLOCK_TOPLEVEL: Top-level CSF clock. */
> +	DRM_PANTHOR_PERF_CLOCK_TOPLEVEL,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_CLOCK_COREGROUP: Core group clock, responsible for the MMU, L2
> +	 * caches and the tiler.
> +	 */
> +	DRM_PANTHOR_PERF_CLOCK_COREGROUP,
> +
> +	/** @DRM_PANTHOR_PERF_CLOCK_SHADER: Clock for the shader cores. */
> +	DRM_PANTHOR_PERF_CLOCK_SHADER,
> +};
> +
> +/**
> + * struct drm_panthor_perf_info - Performance counter interface information
> + *
> + * Structure grouping all queryable information relating to the performance counter
> + * interfaces.
> + */
> +struct drm_panthor_perf_info {
> +	/**
> +	 * @counters_per_block: The number of 8-byte counters available in a block.
> +	 */
> +	__u32 counters_per_block;
> +
> +	/**
> +	 * @sample_header_size: The size of the header struct available at the beginning
> +	 * of every sample.
> +	 */
> +	__u32 sample_header_size;
> +
> +	/**
> +	 * @block_header_size: The size of the header struct inline with the counters for a
> +	 * single block.
> +	 */
> +	__u32 block_header_size;
> +
> +	/** @flags: Combination of drm_panthor_perf_feat_flags flags. */
> +	__u32 flags;
> +
> +	/**
> +	 * @supported_clocks: Bitmask of the clocks supported by the GPU.
> +	 *
> +	 * Each bit represents a variant of the enum drm_panthor_perf_clock.
> +	 *
> +	 * For the same GPU, different implementers may have different clocks for the same hardware
> +	 * block. At the moment, up to four clocks are supported, and any clocks that are present
> +	 * will be reported here.
> +	 */
> +	__u32 supported_clocks;
> +
> +	/** @fw_blocks: Number of FW blocks available. */
> +	__u32 fw_blocks;
> +
> +	/** @csg_blocks: Number of CSG blocks available. */
> +	__u32 csg_blocks;
> +
> +	/** @cshw_blocks: Number of CSHW blocks available. */
> +	__u32 cshw_blocks;
> +
> +	/** @tiler_blocks: Number of tiler blocks available. */
> +	__u32 tiler_blocks;
> +
> +	/** @memsys_blocks: Number of memsys blocks available. */
> +	__u32 memsys_blocks;
> +
> +	/** @shader_blocks: Number of shader core blocks available. */
> +	__u32 shader_blocks;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +};
> +
>  /**
>   * struct drm_panthor_dev_query - Arguments passed to DRM_PANTHOR_IOCTL_DEV_QUERY
>   */
> @@ -1010,6 +1132,371 @@ struct drm_panthor_tiler_heap_destroy {
>  	__u32 pad;
>  };
>  
> +/**
> + * DOC: Performance counter decoding in userspace.
> + *
> + * Each sample will be exposed to userspace in the following manner:
> + *
> + * +--------+--------+------------------------+--------+-------------------------+-----+
> + * | Sample | Block  |        Block           | Block  |         Block           | ... |
> + * | header | header |        counters        | header |         counters        |     |
> + * +--------+--------+------------------------+--------+-------------------------+-----+
> + *
> + * Each sample will start with a sample header of type @struct drm_panthor_perf_sample header,
> + * providing sample-wide information like the start and end timestamps, the counter set currently
> + * configured, and any errors that may have occurred during sampling.
> + *
> + * After the fixed size header, the sample will consist of blocks of
> + * 64-bit @drm_panthor_dev_query_perf_info::counters_per_block counters, each prefaced with a
> + * header of its own, indicating source block type, as well as the cycle count needed to normalize
> + * cycle values within that block, and a clock source identifier.
> + */

At first I was a bit confused about this header, because I could not find it anywhere in the spec.
Then I realised it's been devised specifically for user samples. Is it really impossible for
user space to be able to pick up these values from the FW sample itself, other than the
timestamp and cycles values? I think as of lately some of these can also be queried from UM.

> +/**
> + * enum drm_panthor_perf_block_state - Bitmask of the power and execution states that an individual
> + * hardware block went through in a sampling period.
> + *
> + * Because the sampling period is controlled from userspace, the block may undergo multiple
> + * state transitions, so this must be interpreted as one or more such transitions occurring.
> + */
> +enum drm_panthor_perf_block_state {
> +	/**
> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_UNKNOWN: The state of this block was unknown during
> +	 * the sampling period.
> +	 */
> +	DRM_PANTHOR_PERF_BLOCK_STATE_UNKNOWN = 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_ON: This block was powered on for some or all of
> +	 * the sampling period.
> +	 */
> +	DRM_PANTHOR_PERF_BLOCK_STATE_ON = 1 << 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_OFF: This block was powered off for some or all of the
> +	 * sampling period.
> +	 */
> +	DRM_PANTHOR_PERF_BLOCK_STATE_OFF = 1 << 1,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_AVAILABLE: This block was available for execution for
> +	 * some or all of the sampling period.
> +	 */
> +	DRM_PANTHOR_PERF_BLOCK_STATE_AVAILABLE = 1 << 2,
> +	/**
> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_UNAVAILABLE: This block was unavailable for execution for
> +	 * some or all of the sampling period.
> +	 */
> +	DRM_PANTHOR_PERF_BLOCK_STATE_UNAVAILABLE = 1 << 3,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_NORMAL: This block was executing in normal mode
> +	 * for some or all of the sampling period.
> +	 */
> +	DRM_PANTHOR_PERF_BLOCK_STATE_NORMAL = 1 << 4,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_PROTECTED: This block was executing in protected mode
> +	 * for some or all of the sampling period.
> +	 */
> +	DRM_PANTHOR_PERF_BLOCK_STATE_PROTECTED = 1 << 5,
> +};
> +
> +/**
> + * struct drm_panthor_perf_block_header - Header present before every block in the
> + * sample ringbuffer.
> + */
> +struct drm_panthor_perf_block_header {
> +	/** @block_type: Type of the block. */
> +	__u8 block_type;
> +
> +	/** @block_idx: Block index. */
> +	__u8 block_idx;
> +
> +	/**
> +	 * @block_states: Coarse-grained block transitions, bitmask of enum
> +	 * drm_panthor_perf_block_states.
> +	 */
> +	__u8 block_states;
> +
> +	/**
> +	 * @clock: Clock used to produce the cycle count for this block, taken from
> +	 * enum drm_panthor_perf_clock. The cycle counts are stored in the sample header.
> +	 */
> +	__u8 clock;
> +
> +	/** @pad: MBZ. */
> +	__u8 pad[4];
> +
> +	/** @enable_mask: Bitmask of counters requested during the session setup. */
> +	__u64 enable_mask[2];
> +};
> +
> +/**
> + * enum drm_panthor_perf_sample_flags - Sample-wide events that occurred over the sampling
> + * period.
> + */
> +enum drm_panthor_perf_sample_flags {
> +	/**
> +	 * @DRM_PANTHOR_PERF_SAMPLE_OVERFLOW: This sample contains overflows due to the duration
> +	 * of the sampling period.
> +	 */
> +	DRM_PANTHOR_PERF_SAMPLE_OVERFLOW = 1 << 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_SAMPLE_ERROR: This sample encountered an error condition during
> +	 * the sample duration.
> +	 */
> +	DRM_PANTHOR_PERF_SAMPLE_ERROR = 1 << 1,
> +};
> +
> +/**
> + * struct drm_panthor_perf_sample_header - Header present before every sample.
> + */
> +struct drm_panthor_perf_sample_header {
> +	/**
> +	 * @timestamp_start_ns: Earliest timestamp that values in this sample represent, in
> +	 * nanoseconds. Derived from CLOCK_MONOTONIC_RAW.
> +	 */
> +	__u64 timestamp_start_ns;
> +
> +	/**
> +	 * @timestamp_end_ns: Latest timestamp that values in this sample represent, in
> +	 * nanoseconds. Derived from CLOCK_MONOTONIC_RAW.
> +	 */
> +	__u64 timestamp_end_ns;
> +
> +	/** @block_set: Set of performance counter blocks. */
> +	__u8 block_set;
> +
> +	/** @pad: MBZ. */
> +	__u8 pad[3];
> +
> +	/** @flags: Current sample flags, combination of drm_panthor_perf_sample_flags. */
> +	__u32 flags;
> +
> +	/**
> +	 * @user_data: User data provided as part of the command that triggered this sample.
> +	 *
> +	 * - Automatic samples (periodic ones or those around non-counting periods or power state
> +	 * transitions) will be tagged with the user_data provided as part of the
> +	 * DRM_PANTHOR_PERF_COMMAND_START call.
> +	 * - Manual samples will be tagged with the user_data provided with the
> +	 * DRM_PANTHOR_PERF_COMMAND_SAMPLE call.
> +	 * - A session's final automatic sample will be tagged with the user_data provided with the
> +	 * DRM_PANTHOR_PERF_COMMAND_STOP call.
> +	 */
> +	__u64 user_data;
> +
> +	/**
> +	 * @toplevel_clock_cycles: The number of cycles elapsed between
> +	 * drm_panthor_perf_sample_header::timestamp_start_ns and
> +	 * drm_panthor_perf_sample_header::timestamp_end_ns on the top-level clock if the
> +	 * corresponding bit is set in drm_panthor_perf_info::supported_clocks.
> +	 */
> +	__u64 toplevel_clock_cycles;
> +
> +	/**
> +	 * @coregroup_clock_cycles: The number of cycles elapsed between
> +	 * drm_panthor_perf_sample_header::timestamp_start_ns and
> +	 * drm_panthor_perf_sample_header::timestamp_end_ns on the coregroup clock if the
> +	 * corresponding bit is set in drm_panthor_perf_info::supported_clocks.
> +	 */
> +	__u64 coregroup_clock_cycles;
> +
> +	/**
> +	 * @shader_clock_cycles: The number of cycles elapsed between
> +	 * drm_panthor_perf_sample_header::timestamp_start_ns and
> +	 * drm_panthor_perf_sample_header::timestamp_end_ns on the shader core clock if the
> +	 * corresponding bit is set in drm_panthor_perf_info::supported_clocks.
> +	 */
> +	__u64 shader_clock_cycles;
> +};
> +
> +/**
> + * enum drm_panthor_perf_command - Command type passed to the DRM_PANTHOR_PERF_CONTROL
> + * IOCTL.
> + */
> +enum drm_panthor_perf_command {
> +	/** @DRM_PANTHOR_PERF_COMMAND_SETUP: Create a new performance counter sampling context. */
> +	DRM_PANTHOR_PERF_COMMAND_SETUP,
> +
> +	/** @DRM_PANTHOR_PERF_COMMAND_TEARDOWN: Teardown a performance counter sampling context. */
> +	DRM_PANTHOR_PERF_COMMAND_TEARDOWN,
> +
> +	/** @DRM_PANTHOR_PERF_COMMAND_START: Start a sampling session on the indicated context. */
> +	DRM_PANTHOR_PERF_COMMAND_START,
> +
> +	/** @DRM_PANTHOR_PERF_COMMAND_STOP: Stop the sampling session on the indicated context. */
> +	DRM_PANTHOR_PERF_COMMAND_STOP,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_COMMAND_SAMPLE: Request a manual sample on the indicated context.
> +	 *
> +	 * When the sampling session is configured with a non-zero sampling frequency, any
> +	 * DRM_PANTHOR_PERF_CONTROL calls with this command will be ignored and return an
> +	 * -EINVAL.
> +	 */
> +	DRM_PANTHOR_PERF_COMMAND_SAMPLE,
> +};
> +
> +/**
> + * struct drm_panthor_perf_control - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL.
> + */
> +struct drm_panthor_perf_control {
> +	/** @cmd: Command from enum drm_panthor_perf_command. */
> +	__u32 cmd;
> +
> +	/**
> +	 * @handle: session handle.
> +	 *
> +	 * Returned by the DRM_PANTHOR_PERF_COMMAND_SETUP call.
> +	 * It must be used in subsequent commands for the same context.
> +	 */
> +	__u32 handle;
> +
> +	/**
> +	 * @size: size of the command structure.
> +	 *
> +	 * If the pointer is NULL, the size is updated by the driver to provide the size of the
> +	 * output structure. If the pointer is not NULL, the driver will only copy min(size,
> +	 * struct_size) to the pointer and update the size accordingly.
> +	 */
> +	__u64 size;
> +
> +	/** @pointer: user pointer to a command type struct. */
> +	__u64 pointer;
> +};
> +
> +
> +/**
> + * struct drm_panthor_perf_cmd_setup - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
> + * when the DRM_PANTHOR_PERF_COMMAND_SETUP command is specified.
> + */
> +struct drm_panthor_perf_cmd_setup {
> +	/**
> +	 * @block_set: Set of performance counter blocks.
> +	 *
> +	 * This is a global configuration and only one set can be active at a time. If
> +	 * another client has already requested a counter set, any further requests
> +	 * for a different counter set will fail and return an -EBUSY.
> +	 *
> +	 * If the requested set does not exist, the request will fail and return an -EINVAL.
> +	 */
> +	__u8 block_set;

How do we know for a given hardware model, what block sets it supports? When I wrote the
implementation of perfcnt for Panthor we're using at Collabora right now, that was a question
I could never find an answer for in the spec.

> +	/** @pad: MBZ. */
> +	__u8 pad[7];
> +
> +	/** @fd: eventfd for signalling the availability of a new sample. */
> +	__u32 fd;
> +
> +	/** @ringbuf_handle: Handle to the BO to write perf counter sample to. */
> +	__u32 ringbuf_handle;

If UM is in charge of creating this BO, how would it know how big it should be? I suppose this
would be conveyed by perf info returned by panthor_ioctl_dev_query().     

> +	/**
> +	 * @control_handle: Handle to the BO containing a contiguous 16 byte range, used for the
> +	 * insert and extract indices for the ringbuffer.
> +	 */
> +	__u32 control_handle;
> +
> +	/**
> +	 * @sample_slots: The number of slots available in the userspace-provided BO. Must be
> +	 * a power of 2.
> +	 *
> +	 * If sample_slots * sample_size does not match the BO size, the setup request will fail.
> +	 */
> +	__u32 sample_slots;

Does that mean that the number of user bo slots can be different than the kernel ringbuffer one?

> +
> +	/**
> +	 * @control_offset: Offset into the control BO where the insert and extract indices are
> +	 * located.
> +	 */
> +	__u64 control_offset;
> +
> +	/**
> +	 * @sample_freq_ns: Period between automatic counter sample collection in nanoseconds. Zero
> +	 * disables automatic collection and all collection must be done through explicit calls
> +	 * to DRM_PANTHOR_PERF_CONTROL.SAMPLE. Non-zero values will disable manual counter sampling
> +	 * via the DRM_PANTHOR_PERF_COMMAND_SAMPLE command.
> +	 *
> +	 * This disables software-triggered periodic sampling, but hardware will still trigger
> +	 * automatic samples on certain events, including shader core power transitions, and
> +	 * entries to and exits from non-counting periods. The final stop command will also
> +	 * trigger a sample to ensure no data is lost.
> +	 */
> +	__u64 sample_freq_ns;
> +
> +	/**
> +	 * @fw_enable_mask: Bitmask of counters to request from the FW counter block. Any bits
> +	 * past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
> +	 */
> +	__u64 fw_enable_mask[2];
> +
> +	/**
> +	 * @csg_enable_mask: Bitmask of counters to request from the CSG counter blocks. Any bits
> +	 * past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
> +	 */
> +	__u64 csg_enable_mask[2];
> +
> +	/**
> +	 * @cshw_enable_mask: Bitmask of counters to request from the CSHW counter block. Any bits
> +	 * past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
> +	 */
> +	__u64 cshw_enable_mask[2];
> +
> +	/**
> +	 * @tiler_enable_mask: Bitmask of counters to request from the tiler counter block. Any
> +	 * bits past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
> +	 */
> +	__u64 tiler_enable_mask[2];
> +
> +	/**
> +	 * @memsys_enable_mask: Bitmask of counters to request from the memsys counter blocks. Any
> +	 * bits past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
> +	 */
> +	__u64 memsys_enable_mask[2];
> +
> +	/**
> +	 * @shader_enable_mask: Bitmask of counters to request from the shader core counter blocks.
> +	 * Any bits past the first drm_panthor_perf_info.counters_per_block bits will be ignored.
> +	 */
> +	__u64 shader_enable_mask[2];
> +};
> +
> +/**
> + * struct drm_panthor_perf_cmd_start - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
> + * when the DRM_PANTHOR_PERF_COMMAND_START command is specified.
> + */
> +struct drm_panthor_perf_cmd_start {
> +	/**
> +	 * @user_data: User provided data that will be attached to automatic samples collected
> +	 * until the next DRM_PANTHOR_PERF_COMMAND_STOP.
> +	 */
> +	__u64 user_data;

What is this user data pointer being used for in the samples? What kind of information would
it normally add by having it written into the user samples?

> +};
> +
> +/**
> + * struct drm_panthor_perf_cmd_stop - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
> + * when the DRM_PANTHOR_PERF_COMMAND_STOP command is specified.
> + */
> +struct drm_panthor_perf_cmd_stop {
> +	/**
> +	 * @user_data: User provided data that will be attached to the automatic sample collected
> +	 * at the end of this sampling session.
> +	 */
> +	__u64 user_data;
> +};
> +
> +/**
> + * struct drm_panthor_perf_cmd_sample - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
> + * when the DRM_PANTHOR_PERF_COMMAND_SAMPLE command is specified.
> + */
> +struct drm_panthor_perf_cmd_sample {
> +	/** @user_data: User provided data that will be attached to the sample.*/
> +	__u64 user_data;
> +};
> +
>  #if defined(__cplusplus)
>  }
>  #endif
> -- 
> 2.25.1

Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 2/8] drm/panthor: Add DEV_QUERY.PERF_INFO handling for Gx10
  2024-12-11 16:50 ` [RFC v2 2/8] drm/panthor: Add DEV_QUERY.PERF_INFO handling for Gx10 Lukas Zapolskas
@ 2025-01-27  9:56   ` Adrián Larumbe
  2025-01-27 22:17   ` Adrián Larumbe
  1 sibling, 0 replies; 28+ messages in thread
From: Adrián Larumbe @ 2025-01-27  9:56 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

Reviewed-by: Adrián Larumbe <adrian.larumbe@collabora.com>

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> This change adds the IOCTL to query data about the performance counter
> setup. Some of this data was available via previous DEV_QUERY calls,
> for instance for GPU info, but exposing it via PERF_INFO
> minimizes the overhead of creating a single session to just the one
> aggregate IOCTL.
> 
> To better align the FW interfaces with the arch spec, the patch also
> renames perfcnt to prfcnt.
> 
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  drivers/gpu/drm/panthor/Makefile         |  1 +
>  drivers/gpu/drm/panthor/panthor_device.h |  3 ++
>  drivers/gpu/drm/panthor/panthor_drv.c    | 11 +++++-
>  drivers/gpu/drm/panthor/panthor_fw.c     |  4 ++
>  drivers/gpu/drm/panthor/panthor_fw.h     |  4 ++
>  drivers/gpu/drm/panthor/panthor_perf.c   | 47 ++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_perf.h   | 12 ++++++
>  7 files changed, 81 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_perf.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_perf.h
> 
> diff --git a/drivers/gpu/drm/panthor/Makefile b/drivers/gpu/drm/panthor/Makefile
> index 15294719b09c..0df9947f3575 100644
> --- a/drivers/gpu/drm/panthor/Makefile
> +++ b/drivers/gpu/drm/panthor/Makefile
> @@ -9,6 +9,7 @@ panthor-y := \
>  	panthor_gpu.o \
>  	panthor_heap.o \
>  	panthor_mmu.o \
> +	panthor_perf.o \
>  	panthor_sched.o
>  
>  obj-$(CONFIG_DRM_PANTHOR) += panthor.o
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 0e68f5a70d20..636542c1dcbd 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -119,6 +119,9 @@ struct panthor_device {
>  	/** @csif_info: Command stream interface information. */
>  	struct drm_panthor_csif_info csif_info;
>  
> +	/** @perf_info: Performance counter interface information. */
> +	struct drm_panthor_perf_info perf_info;
> +
>  	/** @gpu: GPU management data. */
>  	struct panthor_gpu *gpu;
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index ad46a40ed9e1..e0ac3107c69e 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -175,7 +175,9 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
>  		 PANTHOR_UOBJ_DECL(struct drm_panthor_sync_op, timeline_value), \
>  		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
>  		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, ringbuf_size), \
> -		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs))
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs), \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_info, shader_blocks))
> +
>  
>  /**
>   * PANTHOR_UOBJ_SET() - Copy a kernel object to a user object.
> @@ -834,6 +836,10 @@ static int panthor_ioctl_dev_query(struct drm_device *ddev, void *data, struct d
>  			args->size = sizeof(priorities_info);
>  			return 0;
>  
> +		case DRM_PANTHOR_DEV_QUERY_PERF_INFO:
> +			args->size = sizeof(ptdev->perf_info);
> +			return 0;
> +
>  		default:
>  			return -EINVAL;
>  		}
> @@ -858,6 +864,9 @@ static int panthor_ioctl_dev_query(struct drm_device *ddev, void *data, struct d
>  		panthor_query_group_priorities_info(file, &priorities_info);
>  		return PANTHOR_UOBJ_SET(args->pointer, args->size, priorities_info);
>  
> +	case DRM_PANTHOR_DEV_QUERY_PERF_INFO:
> +		return PANTHOR_UOBJ_SET(args->pointer, args->size, ptdev->perf_info);
> +
>  	default:
>  		return -EINVAL;
>  	}
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> index 4a2e36504fea..e9530d1d9781 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -21,6 +21,7 @@
>  #include "panthor_gem.h"
>  #include "panthor_gpu.h"
>  #include "panthor_mmu.h"
> +#include "panthor_perf.h"
>  #include "panthor_regs.h"
>  #include "panthor_sched.h"
>  
> @@ -1417,6 +1418,9 @@ int panthor_fw_init(struct panthor_device *ptdev)
>  		goto err_unplug_fw;
>  
>  	panthor_fw_init_global_iface(ptdev);
> +
> +	panthor_perf_info_init(ptdev);
> +
>  	return 0;
>  
>  err_unplug_fw:
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.h b/drivers/gpu/drm/panthor/panthor_fw.h
> index 22448abde992..db10358e24bb 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.h
> +++ b/drivers/gpu/drm/panthor/panthor_fw.h
> @@ -5,6 +5,7 @@
>  #define __PANTHOR_MCU_H__
>  
>  #include <linux/types.h>
> +#include <linux/spinlock.h>
>  
>  struct panthor_device;
>  struct panthor_kernel_bo;
> @@ -197,8 +198,11 @@ struct panthor_fw_global_control_iface {
>  	u32 output_va;
>  	u32 group_num;
>  	u32 group_stride;
> +#define GLB_PERFCNT_FW_SIZE(x) ((((x) >> 16) << 8))
>  	u32 perfcnt_size;
>  	u32 instr_features;
> +#define PERFCNT_FEATURES_MD_SIZE(x) ((x) & GENMASK(3, 0))
> +	u32 perfcnt_features;
>  };
>  
>  struct panthor_fw_global_input_iface {
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
> new file mode 100644
> index 000000000000..0e3d769c1805
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
> @@ -0,0 +1,47 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2023 Collabora Ltd */
> +/* Copyright 2024 Arm ltd. */
> +
> +#include <drm/drm_file.h>
> +#include <drm/drm_gem_shmem_helper.h>
> +#include <drm/drm_managed.h>
> +#include <drm/panthor_drm.h>
> +
> +#include "panthor_device.h"
> +#include "panthor_fw.h"
> +#include "panthor_gpu.h"
> +#include "panthor_perf.h"
> +#include "panthor_regs.h"
> +
> +/**
> + * PANTHOR_PERF_COUNTERS_PER_BLOCK - On CSF architectures pre-11.x, the number of counters
> + * per block was hardcoded to be 64. Arch 11.0 onwards supports the PRFCNT_FEATURES GPU register,
> + * which indicates the same information.
> + */
> +#define PANTHOR_PERF_COUNTERS_PER_BLOCK (64)
> +
> +void panthor_perf_info_init(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	struct drm_panthor_perf_info *const perf_info = &ptdev->perf_info;
> +
> +	if (PERFCNT_FEATURES_MD_SIZE(glb_iface->control->perfcnt_features))
> +		perf_info->flags |= DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT;
> +
> +	if (GPU_ARCH_MAJOR(ptdev->gpu_info.gpu_id) < 11)
> +		perf_info->counters_per_block = PANTHOR_PERF_COUNTERS_PER_BLOCK;
> +
> +	perf_info->sample_header_size = sizeof(struct drm_panthor_perf_sample_header);
> +	perf_info->block_header_size = sizeof(struct drm_panthor_perf_block_header);
> +
> +	if (GLB_PERFCNT_FW_SIZE(glb_iface->control->perfcnt_size)) {
> +		perf_info->fw_blocks = 1;
> +		perf_info->csg_blocks = glb_iface->control->group_num;
> +	}
> +
> +	perf_info->cshw_blocks = 1;
> +	perf_info->tiler_blocks = 1;
> +	perf_info->memsys_blocks = hweight64(ptdev->gpu_info.l2_present);
> +	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
> +}
> +
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
> new file mode 100644
> index 000000000000..cff537a370c9
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> +/* Copyright 2024 Collabora Ltd */
> +/* Copyright 2024 Arm ltd. */
> +
> +#ifndef __PANTHOR_PERF_H__
> +#define __PANTHOR_PERF_H__
> +
> +struct panthor_device;
> +
> +void panthor_perf_info_init(struct panthor_device *ptdev);
> +
> +#endif /* __PANTHOR_PERF_H__ */
> -- 
> 2.25.1


Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 3/8] drm/panthor: Add panthor_perf_init and panthor_perf_unplug
  2024-12-11 16:50 ` [RFC v2 3/8] drm/panthor: Add panthor_perf_init and panthor_perf_unplug Lukas Zapolskas
@ 2025-01-27 12:46   ` Adrián Larumbe
  2025-03-26 14:36     ` Lukas Zapolskas
  2025-01-27 15:50   ` adrian.larumbe
  1 sibling, 1 reply; 28+ messages in thread
From: Adrián Larumbe @ 2025-01-27 12:46 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> Added the panthor_perf system initialization and unplug code to allow
> for the handling of userspace sessions to be added in follow-up patches.
> 
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.c |  7 +++
>  drivers/gpu/drm/panthor/panthor_device.h |  5 +-
>  drivers/gpu/drm/panthor/panthor_perf.c   | 77 ++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_perf.h   |  3 +
>  4 files changed, 91 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index 00f7b8ce935a..1a81a436143b 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -19,6 +19,7 @@
>  #include "panthor_fw.h"
>  #include "panthor_gpu.h"
>  #include "panthor_mmu.h"
> +#include "panthor_perf.h"
>  #include "panthor_regs.h"
>  #include "panthor_sched.h"
>  
> @@ -97,6 +98,7 @@ void panthor_device_unplug(struct panthor_device *ptdev)
>  	/* Now, try to cleanly shutdown the GPU before the device resources
>  	 * get reclaimed.
>  	 */
> +	panthor_perf_unplug(ptdev);
>  	panthor_sched_unplug(ptdev);
>  	panthor_fw_unplug(ptdev);
>  	panthor_mmu_unplug(ptdev);
> @@ -262,6 +264,10 @@ int panthor_device_init(struct panthor_device *ptdev)
>  	if (ret)
>  		goto err_unplug_fw;
>  
> +	ret = panthor_perf_init(ptdev);
> +	if (ret)
> +		goto err_unplug_fw;
> +
>  	/* ~3 frames */
>  	pm_runtime_set_autosuspend_delay(ptdev->base.dev, 50);
>  	pm_runtime_use_autosuspend(ptdev->base.dev);
> @@ -275,6 +281,7 @@ int panthor_device_init(struct panthor_device *ptdev)
>  
>  err_disable_autosuspend:
>  	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> +	panthor_perf_unplug(ptdev);
>  	panthor_sched_unplug(ptdev);
>  
>  err_unplug_fw:
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 636542c1dcbd..aca33d03036c 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -26,7 +26,7 @@ struct panthor_heap_pool;
>  struct panthor_job;
>  struct panthor_mmu;
>  struct panthor_fw;
> -struct panthor_perfcnt;
> +struct panthor_perf;
>  struct panthor_vm;
>  struct panthor_vm_pool;
>  
> @@ -137,6 +137,9 @@ struct panthor_device {
>  	/** @devfreq: Device frequency scaling management data. */
>  	struct panthor_devfreq *devfreq;
>  
> +	/** @perf: Performance counter management data. */
> +	struct panthor_perf *perf;
> +
>  	/** @unplug: Device unplug related fields. */
>  	struct {
>  		/** @lock: Lock used to serialize unplug operations. */
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
> index 0e3d769c1805..e0dc6c4b0cf1 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.c
> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
> @@ -13,6 +13,24 @@
>  #include "panthor_perf.h"
>  #include "panthor_regs.h"
>  
> +struct panthor_perf {
> +	/**
> +	 * @block_set: The global counter set configured onto the HW.
> +	 */
> +	u8 block_set;

I think this field is not used in any further patches. Only in the sampler
struct definition later on you include the same field and assign it from
the ioctl setup arguments.

> +	/** @next_session: The ID of the next session. */
> +	u32 next_session;
> +
> +	/** @session_range: The number of sessions supported at a time. */
> +	struct xa_limit session_range;
> +
> +	/**
> +	 * @sessions: Global map of sessions, accessed by their ID.
> +	 */
> +	struct xarray sessions;
> +};
> +
>  /**
>   * PANTHOR_PERF_COUNTERS_PER_BLOCK - On CSF architectures pre-11.x, the number of counters
>   * per block was hardcoded to be 64. Arch 11.0 onwards supports the PRFCNT_FEATURES GPU register,
> @@ -45,3 +63,62 @@ void panthor_perf_info_init(struct panthor_device *ptdev)
>  	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
>  }
>  
> +/**
> + * panthor_perf_init - Initialize the performance counter subsystem.
> + * @ptdev: Panthor device
> + *
> + * The performance counters require the FW interface to be available to setup the
> + * sampling ringbuffers, so this must be called only after FW is initialized.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_init(struct panthor_device *ptdev)
> +{
> +	struct panthor_perf *perf;
> +
> +	if (!ptdev)
> +		return -EINVAL;
> +
> +	perf = devm_kzalloc(ptdev->base.dev, sizeof(*perf), GFP_KERNEL);
> +	if (ZERO_OR_NULL_PTR(perf))
> +		return -ENOMEM;
> +
> +	xa_init_flags(&perf->sessions, XA_FLAGS_ALLOC);
> +
> +	/* Currently, we only support a single session at a time. */
> +	perf->session_range = (struct xa_limit) {
> +		.min = 0,
> +		.max = 1,
> +	};

I guess at the moment we only allow a single session because periodic sampling
isn't yet implemented. Does that mean multisession support will not be made
available for manual samplers in the future?

> +
> +	drm_info(&ptdev->base, "Performance counter subsystem initialized");
> +
> +	ptdev->perf = perf;
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_perf_unplug - Terminate the performance counter subsystem.
> + * @ptdev: Panthor device.
> + *
> + * This function will terminate the performance counter control structures and any remaining
> + * sessions, after waiting for any pending interrupts.
> + */
> +void panthor_perf_unplug(struct panthor_device *ptdev)
> +{
> +	struct panthor_perf *perf = ptdev->perf;
> +
> +	if (!perf)
> +		return;
> +
> +	if (!xa_empty(&perf->sessions))
> +		drm_err(&ptdev->base,
> +				"Performance counter sessions active when unplugging the driver!");
> +
> +	xa_destroy(&perf->sessions);
> +
> +	devm_kfree(ptdev->base.dev, ptdev->perf);

If we always call devm_kfree, then what is the point of allocating ptdev->perf
with devm_kzalloc?

> +	ptdev->perf = NULL;
> +}
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
> index cff537a370c9..90af8b18358c 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.h
> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
> @@ -9,4 +9,7 @@ struct panthor_device;
>  
>  void panthor_perf_info_init(struct panthor_device *ptdev);
>  
> +int panthor_perf_init(struct panthor_device *ptdev);
> +void panthor_perf_unplug(struct panthor_device *ptdev);
> +
>  #endif /* __PANTHOR_PERF_H__ */
> -- 
> 2.25.1


Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 4/8] drm/panthor: Add panthor perf ioctls
  2024-12-11 16:50 ` [RFC v2 4/8] drm/panthor: Add panthor perf ioctls Lukas Zapolskas
@ 2025-01-27 14:06   ` Adrián Larumbe
  2025-03-26 14:40     ` Lukas Zapolskas
  0 siblings, 1 reply; 28+ messages in thread
From: Adrián Larumbe @ 2025-01-27 14:06 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> This patch implements the PANTHOR_PERF_CONTROL ioctl series, and
> a PANTHOR_GET_UOBJ wrapper to deal with the backwards and forwards
> compatibility of the uAPI.
> 
> Stub function definitions are added to ensure the patch builds on its own,
> and will be removed later in the series.
> 
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  drivers/gpu/drm/panthor/panthor_drv.c  | 155 ++++++++++++++++++++++++-
>  drivers/gpu/drm/panthor/panthor_perf.c |  34 ++++++
>  drivers/gpu/drm/panthor/panthor_perf.h |  19 +++
>  3 files changed, 206 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index e0ac3107c69e..458175f58b15 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -7,6 +7,7 @@
>  #include <asm/arch_timer.h>
>  #endif
>  
> +#include <linux/cleanup.h>
>  #include <linux/list.h>
>  #include <linux/module.h>
>  #include <linux/of_platform.h>
> @@ -31,6 +32,7 @@
>  #include "panthor_gpu.h"
>  #include "panthor_heap.h"
>  #include "panthor_mmu.h"
> +#include "panthor_perf.h"
>  #include "panthor_regs.h"
>  #include "panthor_sched.h"
>  
> @@ -73,6 +75,39 @@ panthor_set_uobj(u64 usr_ptr, u32 usr_size, u32 min_size, u32 kern_size, const v
>  	return 0;
>  }
>  
> +/**
> + * panthor_get_uobj() - Copy kernel object to user object.
> + * @usr_ptr: Users pointer.
> + * @usr_size: Size of the user object.
> + * @min_size: Minimum size for this object.
> + *
> + * Helper automating kernel -> user object copies.
> + *
> + * Don't use this function directly, use PANTHOR_UOBJ_GET() instead.
> + *
> + * Return: valid pointer on success, an encoded error code otherwise.
> + */
> +static void*
> +panthor_get_uobj(u64 usr_ptr, u32 usr_size, u32 min_size)
> +{
> +	int ret;
> +	void *out_alloc __free(kvfree) = NULL;
> +
> +	/* User size shouldn't be smaller than the minimal object size. */
> +	if (usr_size < min_size)
> +		return ERR_PTR(-EINVAL);
> +
> +	out_alloc = kvmalloc(min_size, GFP_KERNEL);
> +	if (!out_alloc)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = copy_struct_from_user(out_alloc, min_size, u64_to_user_ptr(usr_ptr), usr_size);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	return_ptr(out_alloc);
> +}
> +
>  /**
>   * panthor_get_uobj_array() - Copy a user object array into a kernel accessible object array.
>   * @in: The object array to copy.
> @@ -176,8 +211,11 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
>  		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
>  		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, ringbuf_size), \
>  		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs), \
> -		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_info, shader_blocks))
> -
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_info, shader_blocks), \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_setup, shader_enable_mask), \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_start, user_data), \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_stop, user_data), \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_sample, user_data))
>  
>  /**
>   * PANTHOR_UOBJ_SET() - Copy a kernel object to a user object.
> @@ -192,6 +230,24 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
>  			 PANTHOR_UOBJ_MIN_SIZE(_src_obj), \
>  			 sizeof(_src_obj), &(_src_obj))
>  
> +/**
> + * PANTHOR_UOBJ_GET() - Copies a user object from _usr_ptr to a kernel accessible _dest_ptr.
> + * @_dest_ptr: Local varialbe
> + * @_usr_size: Size of the user object.
> + * @_usr_ptr: The pointer of the object in userspace.
> + *
> + * Return: Error code. See panthor_get_uobj().
> + */
> +#define PANTHOR_UOBJ_GET(_dest_ptr, _usr_size, _usr_ptr) \
> +	({ \
> +		typeof(_dest_ptr) _tmp; \
> +		_tmp = panthor_get_uobj(_usr_ptr, _usr_size, \
> +				PANTHOR_UOBJ_MIN_SIZE(_tmp[0])); \
> +		if (!IS_ERR(_tmp)) \
> +			_dest_ptr = _tmp; \
> +		PTR_ERR_OR_ZERO(_tmp); \
> +	})
> +
>  /**
>   * PANTHOR_UOBJ_GET_ARRAY() - Copy a user object array to a kernel accessible
>   * object array.
> @@ -1339,6 +1395,99 @@ static int panthor_ioctl_vm_get_state(struct drm_device *ddev, void *data,
>  	return 0;
>  }
>  
> +static int panthor_ioctl_perf_control(struct drm_device *ddev, void *data,
> +		struct drm_file *file)
> +{
> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_perf_control *args = data;
> +	int ret;
> +
> +	if (!args->pointer) {
> +		switch (args->cmd) {
> +		case DRM_PANTHOR_PERF_COMMAND_SETUP:
> +			args->size = sizeof(struct drm_panthor_perf_cmd_setup);
> +			return 0;
> +
> +		case DRM_PANTHOR_PERF_COMMAND_TEARDOWN:
> +			args->size = 0;
> +			return 0;
> +
> +		case DRM_PANTHOR_PERF_COMMAND_START:
> +			args->size = sizeof(struct drm_panthor_perf_cmd_start);
> +			return 0;
> +
> +		case DRM_PANTHOR_PERF_COMMAND_STOP:
> +			args->size = sizeof(struct drm_panthor_perf_cmd_stop);
> +			return 0;
> +
> +		case DRM_PANTHOR_PERF_COMMAND_SAMPLE:
> +			args->size = sizeof(struct drm_panthor_perf_cmd_sample);
> +			return 0;
> +
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
> +	switch (args->cmd) {
> +	case DRM_PANTHOR_PERF_COMMAND_SETUP:
> +	{
> +		struct drm_panthor_perf_cmd_setup *setup_args __free(kvfree) = NULL;
> +
> +		ret = PANTHOR_UOBJ_GET(setup_args, args->size, args->pointer);
> +		if (ret)
> +			return -EINVAL;
> +
> +		if (setup_args->pad[0])
> +			return -EINVAL;
> +
> +		ret = panthor_perf_session_setup(ptdev, ptdev->perf, setup_args, pfile);

Shouldn't we return the session id as an output param in setup_args or is the
ioctl's return value enough for this?

> +
> +		return ret;
> +	}
> +	case DRM_PANTHOR_PERF_COMMAND_TEARDOWN:
> +	{
> +		return panthor_perf_session_teardown(pfile, ptdev->perf, args->handle);
> +	}
> +	case DRM_PANTHOR_PERF_COMMAND_START:
> +	{
> +		struct drm_panthor_perf_cmd_start *start_args __free(kvfree) = NULL;
> +
> +		ret = PANTHOR_UOBJ_GET(start_args, args->size, args->pointer);
> +		if (ret)
> +			return -EINVAL;
> +
> +		return panthor_perf_session_start(pfile, ptdev->perf, args->handle,
> +				start_args->user_data);
> +	}
> +	case DRM_PANTHOR_PERF_COMMAND_STOP:
> +	{
> +		struct drm_panthor_perf_cmd_stop *stop_args __free(kvfree) = NULL;
> +
> +		ret = PANTHOR_UOBJ_GET(stop_args, args->size, args->pointer);
> +		if (ret)
> +			return -EINVAL;
> +
> +		return panthor_perf_session_stop(pfile, ptdev->perf, args->handle,
> +				stop_args->user_data);
> +	}
> +	case DRM_PANTHOR_PERF_COMMAND_SAMPLE:
> +	{
> +		struct drm_panthor_perf_cmd_sample *sample_args __free(kvfree) = NULL;
> +
> +		ret = PANTHOR_UOBJ_GET(sample_args, args->size, args->pointer);
> +		if (ret)
> +			return -EINVAL;
> +
> +		return panthor_perf_session_sample(pfile, ptdev->perf, args->handle,
> +					sample_args->user_data);
> +	}

For the three cases above, you could define a macro like:

#define perf_cmd(command)							\
	({								\
		struct drm_panthor_perf_cmd_##command * command##_args __free(kvfree) = NULL; \
									\
		ret = PANTHOR_UOBJ_GET(command##_args, args->size, args->pointer); \
		if (ret)						\
			return -EINVAL;					\
		return panthor_perf_session_##command(pfile, ptdev->perf, args->handle, command##_args->user_data); \
	})

	and then do 'perf_cmd(command);' inside each one of them

> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
>  static int
>  panthor_open(struct drm_device *ddev, struct drm_file *file)
>  {
> @@ -1386,6 +1535,7 @@ panthor_postclose(struct drm_device *ddev, struct drm_file *file)
>  
>  	panthor_group_pool_destroy(pfile);
>  	panthor_vm_pool_destroy(pfile);
> +	panthor_perf_session_destroy(pfile, pfile->ptdev->perf);

I would perhaps do this first because pools are first created during file
opening, just to undo things in the opposite sequence.
>  
>  	kfree(pfile);
>  	module_put(THIS_MODULE);
> @@ -1408,6 +1558,7 @@ static const struct drm_ioctl_desc panthor_drm_driver_ioctls[] = {
>  	PANTHOR_IOCTL(TILER_HEAP_CREATE, tiler_heap_create, DRM_RENDER_ALLOW),
>  	PANTHOR_IOCTL(TILER_HEAP_DESTROY, tiler_heap_destroy, DRM_RENDER_ALLOW),
>  	PANTHOR_IOCTL(GROUP_SUBMIT, group_submit, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(PERF_CONTROL, perf_control, DRM_RENDER_ALLOW),
>  };
>  
>  static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
> index e0dc6c4b0cf1..6498279ec036 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.c
> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
> @@ -63,6 +63,40 @@ void panthor_perf_info_init(struct panthor_device *ptdev)
>  	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
>  }
>  
> +int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
> +		struct drm_panthor_perf_cmd_setup *setup_args,
> +		struct panthor_file *pfile)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data)
> +{
> +		return -EOPNOTSUPP;
> +}
> +
> +int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data)
> +{
> +	return -EOPNOTSUPP;
> +
> +}
> +
> +void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf) { }
> +
>  /**
>   * panthor_perf_init - Initialize the performance counter subsystem.
>   * @ptdev: Panthor device
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
> index 90af8b18358c..bfef8874068b 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.h
> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
> @@ -5,11 +5,30 @@
>  #ifndef __PANTHOR_PERF_H__
>  #define __PANTHOR_PERF_H__
>  
> +#include <linux/types.h>
> +
> +struct drm_gem_object;
> +struct drm_panthor_perf_cmd_setup;
>  struct panthor_device;
> +struct panthor_file;
> +struct panthor_perf;
>  
>  void panthor_perf_info_init(struct panthor_device *ptdev);
>  
>  int panthor_perf_init(struct panthor_device *ptdev);
>  void panthor_perf_unplug(struct panthor_device *ptdev);
>  
> +int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
> +		struct drm_panthor_perf_cmd_setup *setup_args,
> +		struct panthor_file *pfile);
> +int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid);
> +int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data);
> +int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data);
> +int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data);
> +void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf);
> +
>  #endif /* __PANTHOR_PERF_H__ */
> -- 
> 2.25.1


Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 5/8] drm/panthor: Introduce sampling sessions to handle userspace clients
  2024-12-11 16:50 ` [RFC v2 5/8] drm/panthor: Introduce sampling sessions to handle userspace clients Lukas Zapolskas
@ 2025-01-27 15:43   ` Adrián Larumbe
  2025-03-26 15:14     ` Lukas Zapolskas
  2025-01-27 21:39   ` Adrián Larumbe
  1 sibling, 1 reply; 28+ messages in thread
From: Adrián Larumbe @ 2025-01-27 15:43 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> To allow for combining the requests from multiple userspace clients, an
> intermediary layer between the HW/FW interfaces and userspace is
> created, containing the information for the counter requests and
> tracking of insert and extract indices. Each session starts inactive and
> must be explicitly activated via PERF_CONTROL.START, and explicitly
> stopped via PERF_CONTROL.STOP. Userspace identifies a single client with
> its session ID and the panthor file it is associated with.
> 
> The SAMPLE and STOP commands both produce a single sample when called,
> and these samples can be disambiguated via the opaque user data field
> passed in the PERF_CONTROL uAPI. If this functionality is not desired,
> these fields can be kept as zero, as the kernel copies this value into
> the corresponding sample without attempting to interpret it.
> 
> Currently, only manual sampling sessions are supported, providing
> samples when userspace calls PERF_CONTROL.SAMPLE, and only a single
> session is allowed at a time. Multiple sessions and periodic sampling
> will be enabled in following patches.
> 
> No protected is provided against the 32-bit hardware counter overflows,

Spelling: protected

> so for the moment it is up to userspace to ensure that the counters are
> sampled at a reasonable frequency.

> The counter set enum is added to the uapi to clarify the restrictions on
> calling the interface.
> 
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.h |   3 +
>  drivers/gpu/drm/panthor/panthor_drv.c    |   1 +
>  drivers/gpu/drm/panthor/panthor_perf.c   | 697 ++++++++++++++++++++++-
>  include/uapi/drm/panthor_drm.h           |  50 +-
>  4 files changed, 732 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index aca33d03036c..9ed1e9aed521 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -210,6 +210,9 @@ struct panthor_file {
>  	/** @ptdev: Device attached to this file. */
>  	struct panthor_device *ptdev;
>  
> +	/** @drm_file: Corresponding drm_file */
> +	struct drm_file *drm_file;

I think you could do away with this member, wrote more about this below.

>  	/** @vms: VM pool attached to this file. */
>  	struct panthor_vm_pool *vms;
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index 458175f58b15..2848ab442d10 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -1505,6 +1505,7 @@ panthor_open(struct drm_device *ddev, struct drm_file *file)
>  	}
>  
>  	pfile->ptdev = ptdev;
> +	pfile->drm_file = file;

Same as above, feel like this is not necessary.

>  
>  	ret = panthor_vm_pool_create(pfile);
>  	if (ret)
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
> index 6498279ec036..42d8b6f8c45d 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.c
> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
> @@ -3,16 +3,162 @@
>  /* Copyright 2024 Arm ltd. */
>  
>  #include <drm/drm_file.h>
> +#include <drm/drm_gem.h>
>  #include <drm/drm_gem_shmem_helper.h>
>  #include <drm/drm_managed.h>
> +#include <drm/drm_print.h>
>  #include <drm/panthor_drm.h>
>  
> +#include <linux/circ_buf.h>
> +#include <linux/iosys-map.h>
> +#include <linux/pm_runtime.h>
> +
>  #include "panthor_device.h"
>  #include "panthor_fw.h"
>  #include "panthor_gpu.h"
>  #include "panthor_perf.h"
>  #include "panthor_regs.h"
>  
> +/**
> + * PANTHOR_PERF_EM_BITS - Number of bits in a user-facing enable mask. This must correspond
> + *                        to the maximum number of counters available for selection on the newest
> + *                        Mali GPUs (128 as of the Mali-Gx15).
> + */
> +#define PANTHOR_PERF_EM_BITS (BITS_PER_TYPE(u64) * 2)
> +
> +/**
> + * enum panthor_perf_session_state - Session state bits.
> + */
> +enum panthor_perf_session_state {
> +	/** @PANTHOR_PERF_SESSION_ACTIVE: The session is active and can be used for sampling. */
> +	PANTHOR_PERF_SESSION_ACTIVE = 0,
> +
> +	/**
> +	 * @PANTHOR_PERF_SESSION_OVERFLOW: The session encountered an overflow in one of the
> +	 *                                 counters during the last sampling period. This flag
> +	 *                                 gets propagated as part of samples emitted for this
> +	 *                                 session, to ensure the userspace client can gracefully
> +	 *                                 handle this data corruption.
> +	 */

How would a client normally deal with data corruption in a sample?

> +	PANTHOR_PERF_SESSION_OVERFLOW,
> +
> +	/** @PANTHOR_PERF_SESSION_MAX: Bits needed to represent the state. Must be last.*/
> +	PANTHOR_PERF_SESSION_MAX,
> +};
> +
> +struct panthor_perf_enable_masks {
> +	/**
> +	 * @link: List node used to keep track of the enable masks aggregated by the sampler.
> +	 */
> +	struct list_head link;
> +
> +	/** @refs: Number of references taken out on an instantiated enable mask. */
> +	struct kref refs;
> +
> +	/**
> +	 * @mask: Array of bitmasks indicating the counters userspace requested, where
> +	 *        one bit represents a single counter. Used to build the firmware configuration
> +	 *        and ensure that userspace clients obtain only the counters they requested.
> +	 */
> +	DECLARE_BITMAP(mask, PANTHOR_PERF_EM_BITS)[DRM_PANTHOR_PERF_BLOCK_MAX];
> +};
> +
> +struct panthor_perf_counter_block {
> +	struct drm_panthor_perf_block_header header;
> +	u64 counters[];
> +};

I think I remember reading in the spec thata block header was 12 bytes in length
but the one defined here seems to have many more fields.

> +struct panthor_perf_session {
> +	DECLARE_BITMAP(state, PANTHOR_PERF_SESSION_MAX);

I'm wondering, because I don't remember having seen this pattern before.
Is it common in kernel code to declare bitmaps for masks of enum values in this way?

> +	/**
> +	 * @user_sample_size: The size of a single sample as exposed to userspace. For the sake of
> +	 *                    simplicity, the current implementation exposes the same structure
> +	 *                    as provided by firmware, after annotating the sample and the blocks,
> +	 *                    and zero-extending the counters themselves (to account for in-kernel
> +	 *                    accumulation).
> +	 *
> +	 *                    This may also allow further memory-optimizations of compressing the
> +	 *                    sample to provide only requested blocks, if deemed to be worth the
> +	 *                    additional complexity.
> +	 */
> +	size_t user_sample_size;
> +
> +	/**
> +	 * @sample_freq_ns: Period between subsequent sample requests. Zero indicates that
> +	 *                  userspace will be responsible for requesting samples.
> +	 */
> +	u64 sample_freq_ns;
> +
> +	/** @sample_start_ns: Sample request time, obtained from a monotonic raw clock. */
> +	u64 sample_start_ns;
> +
> +	/**
> +	 * @user_data: Opaque handle passed in when starting a session, requesting a sample (for
> +	 *             manual sampling sessions only) and when stopping a session. This handle
> +	 *             allows the disambiguation of a sample in the ringbuffer.
> +	 */
> +	u64 user_data;
> +
> +	/**
> +	 * @eventfd: Event file descriptor context used to signal userspace of a new sample
> +	 *           being emitted.
> +	 */
> +	struct eventfd_ctx *eventfd;
> +
> +	/**
> +	 * @enabled_counters: This session's requested counters. Note that these cannot change
> +	 *                    for the lifetime of the session.
> +	 */
> +	struct panthor_perf_enable_masks *enabled_counters;

It seems the enable mask for a session is tied to the session's lifetime. In
panthor_perf_session_setup(), you create one and then increase its reference
count from within panthor_perf_sampler_add(), which is not being called from
anywhere else. Maybe in that case you could do without the reference count and
have a non-pointer struct panthor_perf_enable_masks member here?

> +	/** @ringbuf_slots: Slots in the user-facing ringbuffer. */
> +	size_t ringbuf_slots;
> +
> +	/** @ring_buf: BO for the userspace ringbuffer. */
> +	struct drm_gem_object *ring_buf;
> +
> +	/**
> +	 * @control_buf: BO for the insert and extract indices.
> +	 */
> +	struct drm_gem_object *control_buf;
> +
> +	/**
> +	 * @extract_idx: The extract index is used by userspace to indicate the position of the
> +	 *               consumer in the ringbuffer.
> +	 */
> +	u32 *extract_idx;
> +
> +	/**
> +	 * @insert_idx: The insert index is used by the kernel to indicate the position of the
> +	 *              latest sample exposed to userspace.
> +	 */
> +	u32 *insert_idx;
> +
> +	/** @samples: The mapping of the @ring_buf into the kernel's VA space. */
> +	u8 *samples;
> +
> +	/**
> +	 * @waiting: The list node used by the sampler to track the sessions waiting for a sample.
> +	 */
> +	struct list_head waiting;
> +
> +	/**
> +	 * @pfile: The panthor file which was used to create a session, used for the postclose
> +	 *         handling and to prevent a misconfigured userspace from closing unrelated
> +	 *         sessions.
> +	 */
> +	struct panthor_file *pfile;
> +
> +	/**
> +	 * @ref: Session reference count. The sample delivery to userspace is asynchronous, meaning
> +	 *       the lifetime of the session must extend at least until the sample is exposed to
> +	 *       userspace.
> +	 */
> +	struct kref ref;
> +};
> +
> +
>  struct panthor_perf {
>  	/**
>  	 * @block_set: The global counter set configured onto the HW.
> @@ -63,39 +209,154 @@ void panthor_perf_info_init(struct panthor_device *ptdev)
>  	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
>  }
>  
> -int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
> -		struct drm_panthor_perf_cmd_setup *setup_args,
> -		struct panthor_file *pfile)
> +static struct panthor_perf_enable_masks *panthor_perf_em_new(void)
>  {
> -	return -EOPNOTSUPP;
> +	struct panthor_perf_enable_masks *em = kmalloc(sizeof(*em), GFP_KERNEL);
> +
> +	if (!em)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&em->link);
> +
> +	kref_init(&em->refs);
> +
> +	return em;
>  }
>  
> -int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf,
> -		u32 sid)
> +static struct panthor_perf_enable_masks *panthor_perf_create_em(struct drm_panthor_perf_cmd_setup
> +		*setup_args)
>  {
> -	return -EOPNOTSUPP;
> +	struct panthor_perf_enable_masks *em = panthor_perf_em_new();
> +
> +	if (IS_ERR_OR_NULL(em))
> +		return em;
> +
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_FW],
> +			setup_args->fw_enable_mask, PANTHOR_PERF_EM_BITS);
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_CSG],
> +			setup_args->csg_enable_mask, PANTHOR_PERF_EM_BITS);
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_CSHW],
> +			setup_args->cshw_enable_mask, PANTHOR_PERF_EM_BITS);
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_TILER],
> +			setup_args->tiler_enable_mask, PANTHOR_PERF_EM_BITS);
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_MEMSYS],
> +			setup_args->memsys_enable_mask, PANTHOR_PERF_EM_BITS);
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_SHADER],
> +			setup_args->shader_enable_mask, PANTHOR_PERF_EM_BITS);

To save some repetition, maybe do this, although it might depend on uAPI
structures being arranged in the right way, and the compiler not inserting
unusual padding between consecutive members:

unsigned int block; u64 *mask;
for (mask = &setup_args->fw_enable_mask[0], block = DRM_PANTHOR_PERF_BLOCK_FW;
     block < DRM_PANTHOR_PERF_BLOCK_LAST; block++, mask += 2)
	bitmap_from_arr64(em->mask[block], mask, PANTHOR_PERF_EM_BITS);

> +	return em;
>  }
>  
> -int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
> -		u32 sid, u64 user_data)
> +static void panthor_perf_destroy_em_kref(struct kref *em_kref)
>  {
> -	return -EOPNOTSUPP;
> +	struct panthor_perf_enable_masks *em = container_of(em_kref, typeof(*em), refs);
> +
> +	if (!list_empty(&em->link))
> +		return;

Could this lead to a situation where the enable mask's refcnt reaches 0,
but because it hadn't yet been removed from the session's list, the
mask object is never freed?

> +	kfree(em);
>  }
>  
> -int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
> -		u32 sid, u64 user_data)
> +static size_t get_annotated_block_size(size_t counters_per_block)
>  {
> -		return -EOPNOTSUPP;
> +	return struct_size_t(struct panthor_perf_counter_block, counters, counters_per_block);
>  }
>  
> -int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
> -		u32 sid, u64 user_data)
> +static u32 session_read_extract_idx(struct panthor_perf_session *session)
> +{
> +	/* Userspace will update their own extract index to indicate that a sample is consumed
> +	 * from the ringbuffer, and we must ensure we read the latest value.
> +	 */
> +	return smp_load_acquire(session->extract_idx);
> +}
> +
> +static u32 session_read_insert_idx(struct panthor_perf_session *session)
> +{
> +	return *session->insert_idx;
> +}
> +
> +static void session_get(struct panthor_perf_session *session)
> +{
> +	kref_get(&session->ref);
> +}
> +
> +static void session_free(struct kref *ref)
> +{
> +	struct panthor_perf_session *session = container_of(ref, typeof(*session), ref);
> +
> +	if (session->samples) {
> +		struct iosys_map map = IOSYS_MAP_INIT_VADDR(session->samples);
> +
> +		drm_gem_vunmap_unlocked(session->ring_buf, &map);
> +		drm_gem_object_put(session->ring_buf);
> +	}
> +
> +	if (session->insert_idx && session->extract_idx) {

I think none of these could ever be NULLif session setup succeeded.

> +		struct iosys_map map = IOSYS_MAP_INIT_VADDR(session->extract_idx);
> +
> +		drm_gem_vunmap_unlocked(session->control_buf, &map);
> +		drm_gem_object_put(session->control_buf);
> +	}
> +
> +	kref_put(&session->enabled_counters->refs, panthor_perf_destroy_em_kref);
> +	eventfd_ctx_put(session->eventfd);
> +
> +	devm_kfree(session->pfile->ptdev->base.dev, session);

What is the point of using devm allocations in this case, if we always free
the session manually?

> +}
> +
> +static void session_put(struct panthor_perf_session *session)
> +{
> +	kref_put(&session->ref, session_free);
> +}
> +
> +/**
> + * session_find - Find a session associated with the given session ID and
> + *                panthor_file.
> + * @pfile: Panthor file.
> + * @perf: Panthor perf.
> + * @sid: Session ID.
> + *
> + * The reference count of a valid session is increased to ensure it does not disappear
> + * in the window between the XA lock being dropped and the internal session functions
> + * being called.
> + *
> + * Return: valid session pointer or an ERR_PTR.
> + */
> +static struct panthor_perf_session *session_find(struct panthor_file *pfile,
> +		struct panthor_perf *perf, u32 sid)
>  {
> -	return -EOPNOTSUPP;
> +	struct panthor_perf_session *session;
>  
> +	if (!perf)
> +		return ERR_PTR(-EINVAL);
> +
> +	xa_lock(&perf->sessions);
> +	session = xa_load(&perf->sessions, sid);
> +
> +	if (!session || xa_is_err(session)) {
> +		xa_unlock(&perf->sessions);
> +		return ERR_PTR(-EBADF);

I think we should return NULL in case !session holds true, for panthor_perf_session_start to catch it and return -EINVAL.

> +	}
> +
> +	if (session->pfile != pfile) {
> +		xa_unlock(&perf->sessions);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	session_get(session);
> +	xa_unlock(&perf->sessions);
> +
> +	return session;
>  }
>  
> -void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf) { }
> +static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *const info)

Since this seems to be the size of the sample given to UM, maybe renaming it to
contain _user_ would make its purpose more apparent.

> +{
> +	const size_t block_size = get_annotated_block_size(info->counters_per_block);
> +	const size_t block_nr = info->cshw_blocks + info->csg_blocks + info->fw_blocks +
> +		info->tiler_blocks + info->memsys_blocks + info->shader_blocks;
> +
> +	return sizeof(struct drm_panthor_perf_sample_header) + (block_size * block_nr);
> +}
>  
>  /**
>   * panthor_perf_init - Initialize the performance counter subsystem.
> @@ -130,6 +391,399 @@ int panthor_perf_init(struct panthor_device *ptdev)
>  	ptdev->perf = perf;
>  
>  	return 0;
> +
> +}
> +
> +static int session_validate_set(u8 set)
> +{
> +	if (set > DRM_PANTHOR_PERF_SET_TERTIARY)
> +		return -EINVAL;
> +
> +	if (set == DRM_PANTHOR_PERF_SET_PRIMARY)
> +		return 0;
> +
> +	if (set > DRM_PANTHOR_PERF_SET_PRIMARY)
> +		return capable(CAP_PERFMON) ? 0 : -EACCES;

I'm a bit clueless about the capability API, so I don't quite understand how
this is the way whe decide whether a counter set is legal.

> +	return -EINVAL;
> +}
> +
> +/**
> + * panthor_perf_session_setup - Create a user-visible session.
> + *
> + * @ptdev: Handle to the panthor device.
> + * @perf: Handle to the perf control structure.
> + * @setup_args: Setup arguments passed in via ioctl.
> + * @pfile: Panthor file associated with the request.
> + *
> + * Creates a new session associated with the session ID returned. When initialized, the
> + * session must explicitly request sampling to start with a successive call to PERF_CONTROL.START.
> + *
> + * Return: non-negative session identifier on success or negative error code on failure.
> + */
> +int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
> +		struct drm_panthor_perf_cmd_setup *setup_args,
> +		struct panthor_file *pfile)
> +{
> +	struct panthor_perf_session *session;
> +	struct drm_gem_object *ringbuffer;
> +	struct drm_gem_object *control;
> +	const size_t slots = setup_args->sample_slots;
> +	struct panthor_perf_enable_masks *em;
> +	struct iosys_map rb_map, ctrl_map;
> +	size_t user_sample_size;
> +	int session_id;
> +	int ret;
> +
> +	ret = session_validate_set(setup_args->block_set);
> +	if (ret)
> +		return ret;
> +
> +	session = devm_kzalloc(ptdev->base.dev, sizeof(*session), GFP_KERNEL);
> +	if (ZERO_OR_NULL_PTR(session))
> +		return -ENOMEM;
> +
> +	ringbuffer = drm_gem_object_lookup(pfile->drm_file, setup_args->ringbuf_handle);
> +	if (!ringbuffer) {
> +		ret = -EINVAL;
> +		goto cleanup_session;
> +	}

I guess this would never be the same ringbuffer we created in
panthor_perf_sampler_init(). I don't think it can be, because that one was is
created as a kernel bo, and has no public facing GEM handler. But then this
means we're doing a copy between the FW ringbuffer and the user-supplied
one. However, I remember from past conversations that the goal of a new
implementation was to avoid doing many perfcnt sample copies between the kernel
and UM, because that would require a huge bandwith when the sample period is
small.

> +	control = drm_gem_object_lookup(pfile->drm_file, setup_args->control_handle);
> +	if (!control) {
> +		ret = -EINVAL;
> +		goto cleanup_ringbuf;
> +	}
> +
> +	user_sample_size = session_get_max_sample_size(&ptdev->perf_info) * slots;
> +
> +	if (ringbuffer->size != PFN_ALIGN(user_sample_size)) {
> +		ret = -ENOMEM;
> +		goto cleanup_control;
> +	}

How is information about the max sample size given to UM? I guess through the
values returned through the getparam ioctl(), specifically sample_header_size
and block_header_size?

> +	ret = drm_gem_vmap_unlocked(ringbuffer, &rb_map);
> +	if (ret)
> +		goto cleanup_control;
> +
> +
> +	ret = drm_gem_vmap_unlocked(control, &ctrl_map);
> +	if (ret)
> +		goto cleanup_ring_map;
> +
> +	session->eventfd = eventfd_ctx_fdget(setup_args->fd);
> +	if (IS_ERR_OR_NULL(session->eventfd)) {
> +		ret = PTR_ERR_OR_ZERO(session->eventfd) ?: -EINVAL;
> +		goto cleanup_control_map;
> +	}

I think eventfd_ctx_fdget can only return error values, so there's no need to
check for NULL or ZERO.

> +	em = panthor_perf_create_em(setup_args);
> +	if (IS_ERR_OR_NULL(em)) {
> +		ret = -ENOMEM;
> +		goto cleanup_eventfd;
> +	}
> +
> +	INIT_LIST_HEAD(&session->waiting);
> +	session->extract_idx = ctrl_map.vaddr;
> +	*session->extract_idx = 0;
> +	session->insert_idx = session->extract_idx + 1;
> +	*session->insert_idx = 0;
> +
> +	session->samples = rb_map.vaddr;

I think you might've forgotten this:

        session->ringbuf_slots = slots;

> +	/* TODO This will need validation when we support periodic sampling sessions */
> +	if (setup_args->sample_freq_ns) {
> +		ret = -EOPNOTSUPP;
> +		goto cleanup_em;
> +	}
> +
> +	session->sample_freq_ns = setup_args->sample_freq_ns;
> +	session->user_sample_size = user_sample_size;
> +	session->enabled_counters = em;
> +	session->ring_buf = ringbuffer;
> +	session->control_buf = control;
> +	session->pfile = pfile;
> +
> +	ret = xa_alloc_cyclic(&perf->sessions, &session_id, session, perf->session_range,
> +			&perf->next_session, GFP_KERNEL);

What do we need the next_session index for?

> +	if (ret < 0)
> +		goto cleanup_em;
> +
> +	kref_init(&session->ref);
> +
> +	return session_id;
> +
> +cleanup_em:
> +	kref_put(&em->refs, panthor_perf_destroy_em_kref);
> +
> +cleanup_eventfd:
> +	eventfd_ctx_put(session->eventfd);
> +
> +cleanup_control_map:
> +	drm_gem_vunmap_unlocked(control, &ctrl_map);
> +
> +cleanup_ring_map:
> +	drm_gem_vunmap_unlocked(ringbuffer, &rb_map);
> +
> +cleanup_control:
> +	drm_gem_object_put(control);
> +
> +cleanup_ringbuf:
> +	drm_gem_object_put(ringbuffer);
> +
> +cleanup_session:
> +	devm_kfree(ptdev->base.dev, session);
> +
> +	return ret;
> +}
> +
> +static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *session,
> +		u64 user_data)
> +{
> +	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
> +		return 0;
> +
> +	const u32 extract_idx = session_read_extract_idx(session);
> +	const u32 insert_idx = session_read_insert_idx(session);
> +
> +	/* Must have at least one slot remaining in the ringbuffer to sample. */
> +	if (WARN_ON_ONCE(!CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots)))
> +		return -EBUSY;
> +
> +	session->user_data = user_data;
> +
> +	clear_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
> +
> +	/* TODO Calls to the FW interface will go here in later patches. */
> +	return 0;
> +}
> +
> +static int session_start(struct panthor_perf *perf, struct panthor_perf_session *session,
> +		u64 user_data)
> +{
> +	if (test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
> +		return 0;
> +
> +	set_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
> +
> +	/*
> +	 * For manual sampling sessions, a start command does not correspond to a sample,
> +	 * and so the user data gets discarded.
> +	 */
> +	if (session->sample_freq_ns)
> +		session->user_data = user_data;
> +
> +	/* TODO Calls to the FW interface will go here in later patches. */
> +	return 0;
> +}
> +
> +static int session_sample(struct panthor_perf *perf, struct panthor_perf_session *session,
> +		u64 user_data)
> +{
> +	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
> +		return -EACCES;
> +
> +	const u32 extract_idx = session_read_extract_idx(session);
> +	const u32 insert_idx = session_read_insert_idx(session);
> +
> +	/* Manual sampling for periodic sessions is forbidden. */
> +	if (session->sample_freq_ns)
> +		return -EINVAL;
> +
> +	/*
> +	 * Must have at least two slots remaining in the ringbuffer to sample: one for
> +	 * the current sample, and one for a stop sample, since a stop command should
> +	 * always be acknowledged by taking a final sample and stopping the session.
> +	 */
> +	if (CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots) < 2)
> +		return -EBUSY;
> +
> +	session->sample_start_ns = ktime_get_raw_ns();
> +	session->user_data = user_data;
> +
> +	/* TODO Calls to the FW interface will go here in later patches. */
> +	return 0;
> +}
> +
> +static int session_destroy(struct panthor_perf *perf, struct panthor_perf_session *session)
> +{
> +	session_put(session);
> +
> +	return 0;
> +}
> +
> +static int session_teardown(struct panthor_perf *perf, struct panthor_perf_session *session)
> +{
> +	if (test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
> +		return -EINVAL;
> +
> +	if (!list_empty(&session->waiting))
> +		return -EBUSY;
> +
> +	return session_destroy(perf, session);
> +}
> +
> +/**
> + * panthor_perf_session_teardown - Teardown the session associated with the @sid.
> + * @pfile: Open panthor file.
> + * @perf: Handle to the perf control structure.
> + * @sid: Session identifier.
> + *
> + * Destroys a stopped session where the last sample has been explicitly consumed
> + * or discarded. Active sessions will be ignored.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf, u32 sid)
> +{
> +	int err;
> +	struct panthor_perf_session *session;
> +
> +	xa_lock(&perf->sessions);
> +	session = __xa_store(&perf->sessions, sid, NULL, GFP_KERNEL);

Why not xa_erase() here instead?

> +	if (xa_is_err(session)) {
> +		err = xa_err(session);
> +		goto restore;
> +	}
> +
> +	if (session->pfile != pfile) {
> +		err = -EINVAL;
> +		goto restore;
> +	}
> +
> +	session_get(session);
> +	xa_unlock(&perf->sessions);
> +
> +	err = session_teardown(perf, session);
> +
> +	session_put(session);

I haven't made sure that reference counting is balanced, but noticed session_teardown() is already
putting the session's kref. I'll have a deeper look into it later on.

> +
> +	return err;
> +
> +restore:
> +	__xa_store(&perf->sessions, sid, session, GFP_KERNEL);
> +	xa_unlock(&perf->sessions);
> +
> +	return err;
> +}
> +
> +/**
> + * panthor_perf_session_start - Start sampling on a stopped session.
> + * @pfile: Open panthor file.
> + * @perf: Handle to the panthor perf control structure.
> + * @sid: Session identifier for the desired session.
> + * @user_data: An opaque value passed in from userspace.
> + *
> + * A session counts as stopped when it is created or when it is explicitly stopped after being
> + * started. Starting an active session is treated as a no-op.
> + *
> + * The @user_data parameter will be associated with all subsequent samples for a periodic
> + * sampling session and will be ignored for manual sampling ones in favor of the user data
> + * passed in the PERF_CONTROL.SAMPLE ioctl call.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data)
> +{
> +	struct panthor_perf_session *session = session_find(pfile, perf, sid);
> +	int err;
> +
> +	if (IS_ERR_OR_NULL(session))
> +		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
> +
> +	err = session_start(perf, session, user_data);
> +
> +	session_put(session);
> +
> +	return err;
> +}
> +
> +/**
> + * panthor_perf_session_stop - Stop sampling on an active session.
> + * @pfile: Open panthor file.
> + * @perf: Handle to the panthor perf control structure.
> + * @sid: Session identifier for the desired session.
> + * @user_data: An opaque value passed in from userspace.
> + *
> + * A session counts as active when it has been explicitly started via the PERF_CONTROL.START
> + * ioctl. Stopping a stopped session is treated as a no-op.
> + *
> + * To ensure data is not lost when sampling is stopping, there must always be at least one slot
> + * available for the final automatic sample, and the stop command will be rejected if there is not.
> + *
> + * The @user_data will always be associated with the final sample.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data)
> +{
> +	struct panthor_perf_session *session = session_find(pfile, perf, sid);
> +	int err;
> +
> +	if (IS_ERR_OR_NULL(session))
> +		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
> +
> +	err = session_stop(perf, session, user_data);
> +
> +	session_put(session);
> +
> +	return err;
> +}
> +
> +/**
> + * panthor_perf_session_sample - Request a sample on a manual sampling session.
> + * @pfile: Open panthor file.
> + * @perf: Handle to the panthor perf control structure.
> + * @sid: Session identifier for the desired session.
> + * @user_data: An opaque value passed in from userspace.
> + *
> + * Only an active manual sampler is permitted to request samples directly. Failing to meet either
> + * of these conditions will cause the sampling request to be rejected. Requesting a manual sample
> + * with a full ringbuffer will see the request being rejected.
> + *
> + * The @user_data will always be unambiguously associated one-to-one with the resultant sample.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data)
> +{
> +	struct panthor_perf_session *session = session_find(pfile, perf, sid);
> +	int err;
> +
> +	if (IS_ERR_OR_NULL(session))
> +		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
> +
> +	err = session_sample(perf, session, user_data);
> +
> +	session_put(session);
> +
> +	return err;
> +}
> +
> +/**
> + * panthor_perf_session_destroy - Destroy a sampling session associated with the @pfile.
> + * @perf: Handle to the panthor perf control structure.
> + * @pfile: The file being closed.
> + *
> + * Must be called when the corresponding userspace process is destroyed and cannot close its
> + * own sessions. As such, we offer no guarantees about data delivery.
> + */
> +void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf)
> +{
> +	unsigned long sid;
> +	struct panthor_perf_session *session;
> +
> +	xa_for_each(&perf->sessions, sid, session)
> +	{
> +		if (session->pfile == pfile) {
> +			session_destroy(perf, session);
> +			xa_erase(&perf->sessions, sid);
> +		}
> +	}
>  }
>  
>  /**
> @@ -146,10 +800,17 @@ void panthor_perf_unplug(struct panthor_device *ptdev)
>  	if (!perf)
>  		return;
>  
> -	if (!xa_empty(&perf->sessions))
> +	if (!xa_empty(&perf->sessions)) {
> +		unsigned long sid;
> +		struct panthor_perf_session *session;
> +
>  		drm_err(&ptdev->base,
>  				"Performance counter sessions active when unplugging the driver!");
>  
> +		xa_for_each(&perf->sessions, sid, session)
> +			session_destroy(perf, session);
> +	}
> +
>  	xa_destroy(&perf->sessions);
>  
>  	devm_kfree(ptdev->base.dev, ptdev->perf);
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> index 8a431431da6b..576d3ad46e6d 100644
> --- a/include/uapi/drm/panthor_drm.h
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -458,6 +458,12 @@ enum drm_panthor_perf_block_type {
>  
>  	/** @DRM_PANTHOR_PERF_BLOCK_SHADER: A shader core counter block. */
>  	DRM_PANTHOR_PERF_BLOCK_SHADER,
> +
> +	/** @DRM_PANTHOR_PERF_BLOCK_LAST: Internal use only. */
> +	DRM_PANTHOR_PERF_BLOCK_LAST = DRM_PANTHOR_PERF_BLOCK_SHADER,
> +
> +	/** @DRM_PANTHOR_PERF_BLOCK_MAX: Internal use only. */
> +	DRM_PANTHOR_PERF_BLOCK_MAX = DRM_PANTHOR_PERF_BLOCK_LAST + 1,
>  };
>  
>  /**
> @@ -1368,6 +1374,44 @@ struct drm_panthor_perf_control {
>  	__u64 pointer;
>  };
>  
> +/**
> + * enum drm_panthor_perf_counter_set - The counter set to be requested from the hardware.
> + *
> + * The hardware supports a single performance counter set at a time, so requesting any set other
> + * than the primary may fail if another process is sampling at the same time.
> + *
> + * If in doubt, the primary counter set has the most commonly used counters and requires no
> + * additional permissions to open.
> + */
> +enum drm_panthor_perf_counter_set {
> +	/**
> +	 * @DRM_PANTHOR_PERF_SET_PRIMARY: The default set configured on the hardware.
> +	 *
> +	 * This is the only set for which all counters in all blocks are defined.
> +	 */
> +	DRM_PANTHOR_PERF_SET_PRIMARY,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_SET_SECONDARY: The secondary performance counter set.
> +	 *
> +	 * Some blocks may not have any defined counters for this set, and the block will
> +	 * have the UNAVAILABLE block state permanently set in the block header.
> +	 *
> +	 * Accessing this set requires the calling process to have the CAP_PERFMON capability.
> +	 */
> +	DRM_PANTHOR_PERF_SET_SECONDARY,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_SET_TERTIARY: The tertiary performance counter set.
> +	 *
> +	 * Some blocks may not have any defined counters for this set, and the block will have
> +	 * the UNAVAILABLE block state permanently set in the block header. Note that the
> +	 * tertiary set has the fewest defined counter blocks.
> +	 *
> +	 * Accessing this set requires the calling process to have the CAP_PERFMON capability.
> +	 */
> +	DRM_PANTHOR_PERF_SET_TERTIARY,
> +};
>  
>  /**
>   * struct drm_panthor_perf_cmd_setup - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
> @@ -1375,13 +1419,17 @@ struct drm_panthor_perf_control {
>   */
>  struct drm_panthor_perf_cmd_setup {
>  	/**
> -	 * @block_set: Set of performance counter blocks.
> +	 * @block_set: Set of performance counter blocks, member of
> +	 *             enum drm_panthor_perf_block_set.
>  	 *
>  	 * This is a global configuration and only one set can be active at a time. If
>  	 * another client has already requested a counter set, any further requests
>  	 * for a different counter set will fail and return an -EBUSY.
>  	 *
>  	 * If the requested set does not exist, the request will fail and return an -EINVAL.
> +	 *
> +	 * Some sets have additional requirements to be enabled, and the setup request will
> +	 * fail with an -EACCES if these requirements are not satisfied.
>  	 */

Is this what we check inside session_validate_set() ?

>  	__u8 block_set;
>  
> -- 
> 2.25.1


Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 3/8] drm/panthor: Add panthor_perf_init and panthor_perf_unplug
  2024-12-11 16:50 ` [RFC v2 3/8] drm/panthor: Add panthor_perf_init and panthor_perf_unplug Lukas Zapolskas
  2025-01-27 12:46   ` Adrián Larumbe
@ 2025-01-27 15:50   ` adrian.larumbe
  1 sibling, 0 replies; 28+ messages in thread
From: adrian.larumbe @ 2025-01-27 15:50 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> Added the panthor_perf system initialization and unplug code to allow
> for the handling of userspace sessions to be added in follow-up patches.
> 
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.c |  7 +++
>  drivers/gpu/drm/panthor/panthor_device.h |  5 +-
>  drivers/gpu/drm/panthor/panthor_perf.c   | 77 ++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_perf.h   |  3 +
>  4 files changed, 91 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index 00f7b8ce935a..1a81a436143b 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -19,6 +19,7 @@
>  #include "panthor_fw.h"
>  #include "panthor_gpu.h"
>  #include "panthor_mmu.h"
> +#include "panthor_perf.h"
>  #include "panthor_regs.h"
>  #include "panthor_sched.h"
>  
> @@ -97,6 +98,7 @@ void panthor_device_unplug(struct panthor_device *ptdev)
>  	/* Now, try to cleanly shutdown the GPU before the device resources
>  	 * get reclaimed.
>  	 */
> +	panthor_perf_unplug(ptdev);
>  	panthor_sched_unplug(ptdev);
>  	panthor_fw_unplug(ptdev);
>  	panthor_mmu_unplug(ptdev);
> @@ -262,6 +264,10 @@ int panthor_device_init(struct panthor_device *ptdev)
>  	if (ret)
>  		goto err_unplug_fw;
>  
> +	ret = panthor_perf_init(ptdev);
> +	if (ret)
> +		goto err_unplug_fw;
> +
>  	/* ~3 frames */
>  	pm_runtime_set_autosuspend_delay(ptdev->base.dev, 50);
>  	pm_runtime_use_autosuspend(ptdev->base.dev);
> @@ -275,6 +281,7 @@ int panthor_device_init(struct panthor_device *ptdev)
>  
>  err_disable_autosuspend:
>  	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> +	panthor_perf_unplug(ptdev);
>  	panthor_sched_unplug(ptdev);
>  
>  err_unplug_fw:
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 636542c1dcbd..aca33d03036c 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -26,7 +26,7 @@ struct panthor_heap_pool;
>  struct panthor_job;
>  struct panthor_mmu;
>  struct panthor_fw;
> -struct panthor_perfcnt;
> +struct panthor_perf;
>  struct panthor_vm;
>  struct panthor_vm_pool;
>  
> @@ -137,6 +137,9 @@ struct panthor_device {
>  	/** @devfreq: Device frequency scaling management data. */
>  	struct panthor_devfreq *devfreq;
>  
> +	/** @perf: Performance counter management data. */
> +	struct panthor_perf *perf;
> +
>  	/** @unplug: Device unplug related fields. */
>  	struct {
>  		/** @lock: Lock used to serialize unplug operations. */
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
> index 0e3d769c1805..e0dc6c4b0cf1 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.c
> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
> @@ -13,6 +13,24 @@
>  #include "panthor_perf.h"
>  #include "panthor_regs.h"
>  
> +struct panthor_perf {
> +	/**
> +	 * @block_set: The global counter set configured onto the HW.
> +	 */
> +	u8 block_set;
> +
> +	/** @next_session: The ID of the next session. */
> +	u32 next_session;
> +
> +	/** @session_range: The number of sessions supported at a time. */
> +	struct xa_limit session_range;
> +
> +	/**
> +	 * @sessions: Global map of sessions, accessed by their ID.
> +	 */
> +	struct xarray sessions;
> +};
> +
>  /**
>   * PANTHOR_PERF_COUNTERS_PER_BLOCK - On CSF architectures pre-11.x, the number of counters
>   * per block was hardcoded to be 64. Arch 11.0 onwards supports the PRFCNT_FEATURES GPU register,
> @@ -45,3 +63,62 @@ void panthor_perf_info_init(struct panthor_device *ptdev)
>  	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
>  }
>  
> +/**
> + * panthor_perf_init - Initialize the performance counter subsystem.
> + * @ptdev: Panthor device
> + *
> + * The performance counters require the FW interface to be available to setup the
> + * sampling ringbuffers, so this must be called only after FW is initialized.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_init(struct panthor_device *ptdev)
> +{
> +	struct panthor_perf *perf;
> +
> +	if (!ptdev)
> +		return -EINVAL;
> +
> +	perf = devm_kzalloc(ptdev->base.dev, sizeof(*perf), GFP_KERNEL);
> +	if (ZERO_OR_NULL_PTR(perf))
> +		return -ENOMEM;
> +
> +	xa_init_flags(&perf->sessions, XA_FLAGS_ALLOC);
> +
> +	/* Currently, we only support a single session at a time. */
> +	perf->session_range = (struct xa_limit) {
> +		.min = 0,
> +		.max = 1,
> +	};
> +
> +	drm_info(&ptdev->base, "Performance counter subsystem initialized");
> +
> +	ptdev->perf = perf;
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_perf_unplug - Terminate the performance counter subsystem.
> + * @ptdev: Panthor device.
> + *
> + * This function will terminate the performance counter control structures and any remaining
> + * sessions, after waiting for any pending interrupts.
> + */
> +void panthor_perf_unplug(struct panthor_device *ptdev)
> +{
> +	struct panthor_perf *perf = ptdev->perf;
> +
> +	if (!perf)
> +		return;
> +
> +	if (!xa_empty(&perf->sessions))
> +		drm_err(&ptdev->base,
> +				"Performance counter sessions active when unplugging the driver!");
> +
> +	xa_destroy(&perf->sessions);
> +
> +	devm_kfree(ptdev->base.dev, ptdev->perf);
> +
> +	ptdev->perf = NULL;

I thought this could be racy with the perfcnt IRQ handler, but maybe if the sessions array is
already empty by then then it doesn't matter.

> +}
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
> index cff537a370c9..90af8b18358c 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.h
> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
> @@ -9,4 +9,7 @@ struct panthor_device;
>  
>  void panthor_perf_info_init(struct panthor_device *ptdev);
>  
> +int panthor_perf_init(struct panthor_device *ptdev);
> +void panthor_perf_unplug(struct panthor_device *ptdev);
> +
>  #endif /* __PANTHOR_PERF_H__ */
> -- 
> 2.25.1

Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 6/8] drm/panthor: Implement the counter sampler and sample handling
  2024-12-11 16:50 ` [RFC v2 6/8] drm/panthor: Implement the counter sampler and sample handling Lukas Zapolskas
@ 2025-01-27 16:53   ` Adrián Larumbe
  2025-03-27  8:53     ` Lukas Zapolskas
  2025-01-27 21:09   ` Adrián Larumbe
  1 sibling, 1 reply; 28+ messages in thread
From: Adrián Larumbe @ 2025-01-27 16:53 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> From: Adrián Larumbe <adrian.larumbe@collabora.com>
> 
> The sampler aggregates counter and set requests coming from userspace
> and mediates interactions with the FW interface, to ensure that user
> sessions cannot override the global configuration.
> 
> From the top-level interface, the sampler supports two different types
> of samples: clearing samples and regular samples. Clearing samples are
> a special sample type that allow for the creation of a sampling
> baseline, to ensure that a session does not obtain counter data from
> before its creation.
> 
> Upon receipt of a relevant interrupt, corresponding to one of the three
> relevant bits of the GLB_ACK register, the sampler takes any samples
> that occurred, and, based on the insert and extract indices, accumulates
> them to an internal storage buffer after zero-extending the counters
> from the 32-bit counters emitted by the hardware to 64-bit counters
> for internal accumulation.
> 
> When the performance counters are enabled, the FW ensures no counter
> data is lost when entering and leaving non-counting regions by producing
> automatic samples that do not correspond to a GLB_REQ.PRFCNT_SAMPLE
> request. Such regions may be per hardware unit, such as when a shader
> core powers down, or global. Most of these events do not directly
> correspond to session sample requests, so any intermediary counter data
> must be stored into a temporary accumulation buffer.
> 
> If there are sessions waiting for a sample, this accumulated buffer will
> be taken, and emitted for each waiting client. During this phase,
> information like the timestamps of sample request and sample emission,
> type of the counter block and block index annotations are added to the
> sample header and block headers. If no sessions are waiting for
> a sample, this accumulation buffer is kept until the next time a sample
> is requested.
> 
> Special handling is needed for the PRFCNT_OVERFLOW interrupt, which is
> an indication that the internal sample handling rate was insufficient.
> 
> The sampler also maintains a buffer descriptor indicating the structure
> of a firmware sample, since neither the firmware nor the hardware give
> any indication of the sample structure, only that it is composed out of
> three parts:
>  - the metadata is an optional initial counter block on supporting
>    firmware versions that contains a single counter, indicating the
>    reason a sample was taken when entering global non-counting regions.
>    This is used to provide coarse-grained information about why a sample
>    was taken to userspace, to help userspace interpret variations in
>    counter magnitude.
>  - the firmware component of the sample is composed out of a global
>    firmware counter block on supporting firmware versions.
>  - the hardware component is the most sizeable of the three and contains
>    a block of counters for each of the underlying hardware resources. It
>    has a fixed structure that is described in the architecture
>    specification, and contains the command stream hardware block(s), the
>    tiler block(s), the MMU and L2 blocks (collectively named the memsys
>    blocks) and the shader core blocks, in that order.
> The structure of this buffer changes based on the firmware and hardware
> combination, but is constant on a single system.

I already brought this up in a previous patch review. This approach of
describing the layout of counters in the FW ringbuffer inside the kernel
deviates from what was being done for Panfrost already, where the kernel does
the minimal job of providing raw samples to user mode, and the UM programs need
to rely on layout files that lend a meaning to the raw data block. In the case
of Panfrost, because many generations of HW are supported, it seems keeping this
deal of information in the kernel isn't very scalable, and also in my view goes
against the practice of having the driver do as little as possible, other than
streaming raw data to UM and letting programs handle it in more sophisticated
ways.

> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> Co-developed-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  drivers/gpu/drm/panthor/panthor_fw.c   |   5 +
>  drivers/gpu/drm/panthor/panthor_fw.h   |   9 +-
>  drivers/gpu/drm/panthor/panthor_perf.c | 882 ++++++++++++++++++++++++-
>  drivers/gpu/drm/panthor/panthor_perf.h |   2 +
>  include/uapi/drm/panthor_drm.h         |   5 +-
>  5 files changed, 892 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> index e9530d1d9781..cd68870ced18 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -1000,9 +1000,12 @@ static void panthor_fw_init_global_iface(struct panthor_device *ptdev)
>  
>  	/* Enable interrupts we care about. */
>  	glb_iface->input->ack_irq_mask = GLB_CFG_ALLOC_EN |
> +					 GLB_PERFCNT_SAMPLE |
>  					 GLB_PING |
>  					 GLB_CFG_PROGRESS_TIMER |
>  					 GLB_CFG_POWEROFF_TIMER |
> +					 GLB_PERFCNT_THRESHOLD |
> +					 GLB_PERFCNT_OVERFLOW |
>  					 GLB_IDLE_EN |
>  					 GLB_IDLE;
>  
> @@ -1031,6 +1034,8 @@ static void panthor_job_irq_handler(struct panthor_device *ptdev, u32 status)
>  		return;
>  
>  	panthor_sched_report_fw_events(ptdev, status);
> +
> +	panthor_perf_report_irq(ptdev, status);
>  }
>  PANTHOR_IRQ_HANDLER(job, JOB, panthor_job_irq_handler);
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.h b/drivers/gpu/drm/panthor/panthor_fw.h
> index db10358e24bb..7ed34d2de8b4 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.h
> +++ b/drivers/gpu/drm/panthor/panthor_fw.h
> @@ -199,9 +199,10 @@ struct panthor_fw_global_control_iface {
>  	u32 group_num;
>  	u32 group_stridei;
>  #define GLB_PERFCNT_FW_SIZE(x) ((((x) >> 16) << 8))
> +#define GLB_PERFCNT_HW_SIZE(x) (((x) & GENMASK(15, 0)) << 8)
>  	u32 perfcnt_size;
>  	u32 instr_features;
> -#define PERFCNT_FEATURES_MD_SIZE(x) ((x) & GENMASK(3, 0))
> +#define PERFCNT_FEATURES_MD_SIZE(x) (((x) & GENMASK(3, 0)) << 8)
>  	u32 perfcnt_features;

I've checked the spec and this field isn't mentioned:
docs/g510/gpu/html/register_set/GLB_CONTROL_BLOCK.htm

>  };
>  
> @@ -211,7 +212,7 @@ struct panthor_fw_global_input_iface {
>  #define GLB_CFG_ALLOC_EN			BIT(2)
>  #define GLB_CFG_POWEROFF_TIMER			BIT(3)
>  #define GLB_PROTM_ENTER				BIT(4)
> -#define GLB_PERFCNT_EN				BIT(5)
> +#define GLB_PERFCNT_ENABLE			BIT(5)
>  #define GLB_PERFCNT_SAMPLE			BIT(6)
>  #define GLB_COUNTER_EN				BIT(7)
>  #define GLB_PING				BIT(8)
> @@ -234,7 +235,6 @@ struct panthor_fw_global_input_iface {
>  	u32 doorbell_req;
>  	u32 reserved1;
>  	u32 progress_timer;
> -
>  #define GLB_TIMER_VAL(x)			((x) & GENMASK(30, 0))
>  #define GLB_TIMER_SOURCE_GPU_COUNTER		BIT(31)
>  	u32 poweroff_timer;
> @@ -244,6 +244,9 @@ struct panthor_fw_global_input_iface {
>  	u64 perfcnt_base;
>  	u32 perfcnt_extract;
>  	u32 reserved3[3];
> +#define GLB_PRFCNT_CONFIG_SIZE(x) ((x) & GENMASK(7, 0))
> +#define GLB_PRFCNT_CONFIG_SET(x) (((x) & GENMASK(1, 0)) << 8)
> +#define GLB_PRFCNT_METADATA_ENABLE BIT(10)
>  	u32 perfcnt_config;
>  	u32 perfcnt_csg_select;
>  	u32 perfcnt_fw_enable;

In this very same file, you might want to add the following halt status bits to
panthor_fw_global_output_iface:

struct panthor_fw_global_output_iface {
	u32 ack;
	u32 reserved1;
	u32 doorbell_ack;
	u32 reserved2;
	u32 halt_status;
+#define GLB_PERFCNT_STATUS_FAILED            BIT(0)
+#define GLB_PERFCNT_STATUS_POWERON           BIT(1)
+#define GLB_PERFCNT_STATUS_POWEROFF          BIT(2)
+#define GLB_PERFCNT_STATUS_PROTSESSION       BIT(3)
	u32 perfcnt_status;
	u32 perfcnt_insert;
};
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
> index 42d8b6f8c45d..d62d97c448da 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.c
> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
> @@ -15,7 +15,9 @@
>  
>  #include "panthor_device.h"
>  #include "panthor_fw.h"
> +#include "panthor_gem.h"
>  #include "panthor_gpu.h"
> +#include "panthor_mmu.h"
>  #include "panthor_perf.h"
>  #include "panthor_regs.h"
>  
> @@ -26,6 +28,41 @@
>   */
>  #define PANTHOR_PERF_EM_BITS (BITS_PER_TYPE(u64) * 2)
>  
> +/**
> + * PANTHOR_PERF_FW_RINGBUF_SLOTS - Number of slots allocated for individual samples when configuring
> + *                                 the performance counter ring buffer to firmware. This can be
> + *                                 used to reduce memory consumption on low memory systems.
> + */
> +#define PANTHOR_PERF_FW_RINGBUF_SLOTS (32)

Perhaps this should be a module parameter with a default value of 32?

> +
> +/**
> + * PANTHOR_CTR_TIMESTAMP_LO - The first architecturally mandated counter of every block type
> + *                            contains the low 32-bits of the TIMESTAMP value.
> + */
> +#define PANTHOR_CTR_TIMESTAMP_LO (0)
> +
> +/**
> + * PANTHOR_CTR_TIMESTAMP_HI - The register offset containinig the high 32-bits of the TIMESTAMP
> + *                            value.
> + */
> +#define PANTHOR_CTR_TIMESTAMP_HI (1)
> +
> +/**
> + * PANTHOR_CTR_PRFCNT_EN - The register offset containing the enable mask for the enabled counters
> + *                         that were written to memory.
> + */
> +#define PANTHOR_CTR_PRFCNT_EN (2)
> +
> +/**
> + * PANTHOR_HEADER_COUNTERS - The first four counters of every block type are architecturally
> + *                           defined to be equivalent. The fourth counter is always reserved,
> + *                           and should be zero and as such, does not have a separate define.
> + *
> + *                           These are the only four counters that are the same between different
> + *                           blocks and are consistent between different architectures.
> + */
> +#define PANTHOR_HEADER_COUNTERS (4)
> +
>  /**
>   * enum panthor_perf_session_state - Session state bits.
>   */
> @@ -158,6 +195,135 @@ struct panthor_perf_session {
>  	struct kref ref;
>  };
>  
> +struct panthor_perf_buffer_descriptor {
> +	/**
> +	 * @block_size: The size of a single block in the FW ring buffer, equal to
> +	 *              sizeof(u32) * counters_per_block.
> +	 */
> +	size_t block_size;
> +
> +	/**
> +	 * @buffer_size: The total size of the buffer, equal to (#hardware blocks +
> +	 *               #firmware blocks) * block_size.
> +	 */
> +	size_t buffer_size;
> +
> +	/**
> +	 * @available_blocks: Bitmask indicating the blocks supported by the hardware and firmware
> +	 *                    combination. Note that this can also include blocks that will not
> +	 *                    be exposed to the user.
> +	 */
> +	DECLARE_BITMAP(available_blocks, DRM_PANTHOR_PERF_BLOCK_MAX);
> +	struct {
> +		/** @offset: Starting offset of a block of type @type in the FW ringbuffer. */
> +		size_t offset;
> +
> +		/** @type: Type of the blocks between @blocks[i].offset and @blocks[i+1].offset. */
> +		enum drm_panthor_perf_block_type type;

I think perhaps you could avoid using this, because a block type number is the
same as its index in the blocks array. See [1]

> +		/** @block_count: Number of blocks of the given @type, starting at @offset. */
> +		size_t block_count;
> +	} blocks[DRM_PANTHOR_PERF_BLOCK_MAX];
> +};

It seems with this approach for storing the layout of counters, this would depend
on them being arranged always in the same way. In Panfros, however, counter layout
was being handled in UM by parsing XML files into a C file that filled in as a
buffer descriptor. I'm afraid maybe pushing this into kernel space might make adding
support for new devices with different counter layout not easy.

> +
> +/**
> + * STRUCT panthor_perf_sampler - Interface to de-multiplex firmware interaction and handle
> + *                               global interactions.
> + */
> +struct panthor_perf_sampler {
> +	/** @sample_requested: A sample has been requested. */
> +	bool sample_requested;
> +
> +	/**
> +	 * @last_ack: Temporarily storing the last GLB_ACK status. Without storing this data,
> +	 *            we do not know whether a toggle bit has been handled.
> +	 */
> +	u32 last_ack;
> +
> +	/**
> +	 * @enabled_clients: The number of clients concurrently requesting samples. To ensure that
> +	 *                   one client cannot deny samples to another, we must ensure that clients
> +	 *                   are effectively reference counted.
> +	 */
> +	atomic_t enabled_clients;
> +
> +	/**
> +	 * @sample_handled: Synchronization point between the interrupt bottom half and the
> +	 *                  main sampler interface. Must be re-armed solely on a new request
> +	 *                  coming to the sampler.
> +	 */
> +	struct completion sample_handled;
> +
> +	/** @rb: Kernel BO in the FW AS containing the sample ringbuffer. */
> +	struct panthor_kernel_bo *rb;
> +
> +	/**
> +	 * @sample_size: The size of a single sample in the FW ringbuffer. This is computed using
> +	 *               the hardware configuration according to the architecture specification,
> +	 *               and cross-validated against the sample size reported by FW to ensure
> +	 *               a consistent view of the buffer size.
> +	 */
> +	size_t sample_size;
> +
> +	/**
> +	 * @sample_slots: Number of slots for samples in the FW ringbuffer. Could be static,
> +	 *		  but may be useful to customize for low-memory devices.
> +	 */
> +	size_t sample_slots;
> +
> +	/**
> +	 * @config_lock: Lock serializing changes to the global counter configuration, including
> +	 *               requested counter set and the counters themselves.
> +	 */
> +	struct mutex config_lock;
> +
> +	/**
> +	 * @ems: List of enable maps of the active sessions. When removing a session, the number
> +	 *       of requested counters may decrease, and the union of enable masks from the multiple
> +	 *       sessions does not provide sufficient information to reconstruct the previous
> +	 *       enable mask.
> +	 */
> +	struct list_head ems;

Maybe ems and config lock could be in the same anonymous structure.

> +
> +	/** @em: Combined enable mask for all of the active sessions. */
> +	struct panthor_perf_enable_masks *em;
> +
> +	/**
> +	 * @desc: Buffer descriptor for a sample in the FW ringbuffer. Note that this buffer
> +	 *        at current time does some interesting things with the zeroth block type. On
> +	 *        newer FW revisions, the first counter block of the sample is the METADATA block,
> +	 *        which contains a single value indicating the reason the sample was taken (if
> +	 *        any). This block must not be exposed to userspace, as userspace does not
> +	 *        have sufficient context to interpret it. As such, this block type is not
> +	 *        added to the uAPI, but we still use it in the kernel.
> +	 */
> +	struct panthor_perf_buffer_descriptor desc;
> +
> +	/**
> +	 * @sample: Pointer to an upscaled and annotated sample that may be emitted to userspace.
> +	 *          This is used both as an intermediate buffer to do the zero-extension of the
> +	 *          32-bit counters to 64-bits and as a storage buffer in case the sampler
> +	 *          requests an additional sample that was not requested by any of the top-level
> +	 *          sessions (for instance, when changing the enable masks).
> +	 */
> +	u8 *sample;
> +
> +	/** @sampler_lock: Lock used to guard the list of sessions requesting samples. */
> +	struct mutex sampler_lock;
> +
> +	/** @sampler_list: List of sessions requesting samples. */
> +	struct list_head sampler_list;

Shouldn't this be called session list instead?
Wouldn't it better to include both the list and its mutex into a single anonymous structure?

> +	/** @set_config: The set that will be configured onto the hardware. */
> +	u8 set_config;
> +
> +	/**
> +	 * @ptdev: Backpointer to the Panthor device, needed to ring the global doorbell and
> +	 *         interface with FW.
> +	 */
> +	struct panthor_device *ptdev;
> +};
>  
>  struct panthor_perf {
>  	/**
> @@ -175,6 +341,9 @@ struct panthor_perf {
>  	 * @sessions: Global map of sessions, accessed by their ID.
>  	 */
>  	struct xarray sessions;
> +
> +	/** @sampler: FW control interface. */
> +	struct panthor_perf_sampler sampler;
>  };
>  
>  /**
> @@ -247,6 +416,23 @@ static struct panthor_perf_enable_masks *panthor_perf_create_em(struct drm_panth
>  	return em;
>  }
>  
> +static void panthor_perf_em_add(struct panthor_perf_enable_masks *dst_em,
> +		const struct panthor_perf_enable_masks *const src_em)
> +{
> +	size_t i = 0;
> +
> +	for (i = DRM_PANTHOR_PERF_BLOCK_FW; i <= DRM_PANTHOR_PERF_BLOCK_LAST; i++)
> +		bitmap_or(dst_em->mask[i], dst_em->mask[i], src_em->mask[i], PANTHOR_PERF_EM_BITS);
> +}
> +
> +static void panthor_perf_em_zero(struct panthor_perf_enable_masks *em)
> +{
> +	size_t i = 0;
> +
> +	for (i = DRM_PANTHOR_PERF_BLOCK_FW; i <= DRM_PANTHOR_PERF_BLOCK_LAST; i++)
> +		bitmap_zero(em->mask[i], PANTHOR_PERF_EM_BITS);
> +}
> +
>  static void panthor_perf_destroy_em_kref(struct kref *em_kref)
>  {
>  	struct panthor_perf_enable_masks *em = container_of(em_kref, typeof(*em), refs);
> @@ -270,6 +456,12 @@ static u32 session_read_extract_idx(struct panthor_perf_session *session)
>  	return smp_load_acquire(session->extract_idx);
>  }
>  
> +static void session_write_insert_idx(struct panthor_perf_session *session, u32 idx)
> +{
> +	/* Userspace needs the insert index to know where to look for the sample. */
> +	smp_store_release(session->insert_idx, idx);
> +}
> +
>  static u32 session_read_insert_idx(struct panthor_perf_session *session)
>  {
>  	return *session->insert_idx;
> @@ -349,6 +541,70 @@ static struct panthor_perf_session *session_find(struct panthor_file *pfile,
>  	return session;
>  }
>  
> +static u32 compress_enable_mask(unsigned long *const src)
> +{
> +	size_t i;
> +	u32 result = 0;
> +	unsigned long clump;
> +
> +	for_each_set_clump8(i, clump, src, PANTHOR_PERF_EM_BITS) {
> +		const unsigned long shift = div_u64(i, 4);
> +
> +		result |= !!(clump & GENMASK(3, 0)) << shift;
> +		result |= !!(clump & GENMASK(7, 4)) << (shift + 1);
> +	}
> +
> +	return result;
> +}
> +
> +static void expand_enable_mask(u32 em, unsigned long *const dst)
> +{
> +	size_t i;
> +	DECLARE_BITMAP(emb, BITS_PER_TYPE(u32));
> +
> +	bitmap_from_arr32(emb, &em, BITS_PER_TYPE(u32));
> +
> +	for_each_set_bit(i, emb, BITS_PER_TYPE(u32))
> +		bitmap_set(dst, i * 4, 4);
> +}
> +
> +/**
> + * panthor_perf_block_data - Identify the block index and type based on the offset.
> + *
> + * @desc:   FW buffer descriptor.
> + * @offset: The current offset being examined.
> + * @idx:    Pointer to an output index.
> + * @type:   Pointer to an output block type.
> + *
> + * To disambiguate different types of blocks as well as different blocks of the same type,
> + * the offset into the FW ringbuffer is used to uniquely identify the block being considered.
> + *
> + * In the future, this is a good time to identify whether a block will be empty,
> + * allowing us to short-circuit its processing after emitting header information.
> + */
> +static void panthor_perf_block_data(struct panthor_perf_buffer_descriptor *const desc,
> +		size_t offset, u32 *idx, enum drm_panthor_perf_block_type *type)
> +{
> +	unsigned long id;
> +
> +	for_each_set_bit(id, desc->available_blocks, DRM_PANTHOR_PERF_BLOCK_LAST) {
> +		const size_t block_start = desc->blocks[id].offset;
> +		const size_t block_count = desc->blocks[id].block_count;
> +		const size_t block_end = desc->blocks[id].offset +
> +			desc->block_size * block_count;
> +
> +		if (!block_count)
> +			continue;
> +
> +		if ((offset >= block_start) && (offset < block_end)) {
> +			*type = desc->blocks[id].type;

  [1] I think in this case, id will always be the same as desc->blocks[id].type, so maybe
  

> +			*idx = div_u64(offset - desc->blocks[id].offset, desc->block_size);
> +
> +			return;
> +		}
> +	}
> +}
> +
>  static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *const info)
>  {
>  	const size_t block_size = get_annotated_block_size(info->counters_per_block);
> @@ -358,6 +614,520 @@ static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *co
>  	return sizeof(struct drm_panthor_perf_sample_header) + (block_size * block_nr);
>  }
>  
> +static u32 panthor_perf_handle_sample(struct panthor_device *ptdev, u32 extract_idx, u32 insert_idx)
> +{
> +	struct panthor_perf *perf = ptdev->perf;
> +	struct panthor_perf_sampler *sampler = &ptdev->perf->sampler;
> +	const size_t ann_block_size =
> +		get_annotated_block_size(ptdev->perf_info.counters_per_block);
> +	u32 i;
> +
> +	for (i = extract_idx; i != insert_idx; i = (i + 1) % sampler->sample_slots) {
> +		u8 *fw_sample = (u8 *)sampler->rb->kmap + i * sampler->sample_size;
> +
> +		for (size_t fw_off = 0, ann_off = sizeof(struct drm_panthor_perf_sample_header);
> +				fw_off < sampler->desc.buffer_size;
> +				fw_off += sampler->desc.block_size)
> +
> +		{
> +			u32 idx;
> +			enum drm_panthor_perf_block_type type;
> +			DECLARE_BITMAP(expanded_em, PANTHOR_PERF_EM_BITS);
> +			struct panthor_perf_counter_block *blk =
> +				(typeof(blk))(perf->sampler.sample + ann_off);
> +			const u32 prfcnt_en = blk->counters[PANTHOR_CTR_PRFCNT_EN];

This implies there was something in blk->counters[PANTHOR_CTR_PRFCNT_EN], but
where was it first written?
Shouldn't this be fetched from fw_sample? sampler.sample is being reset to 0 after a sample
is copied into the UM ringbuffer, so prfcnt_en will always be 0 here.

> +
> +			panthor_perf_block_data(&sampler->desc, fw_off, &idx, &type);
> +
> +			/**
> +			 * TODO Data from the metadata block must be used to populate the
> +			 * block state information.
> +			 */
> +			if (type == DRM_PANTHOR_PERF_BLOCK_METADATA)
> +				continue;
> +
> +			expand_enable_mask(prfcnt_en, expanded_em);
> +
> +			blk->header = (struct drm_panthor_perf_block_header) {
> +				.clock = 0,
> +				.block_idx = idx,
> +				.block_type = type,
> +				.block_states = DRM_PANTHOR_PERF_BLOCK_STATE_UNKNOWN
> +			};
> +			bitmap_to_arr64(blk->header.enable_mask, expanded_em, PANTHOR_PERF_EM_BITS);
> +
> +			u32 *block = (u32 *)(fw_sample + fw_off);
> +
> +			/*
> +			 * The four header counters must be treated differently, because they are
> +			 * not additive. For the fourth, the assignment does not matter, as it
> +			 * is reserved and should be zero.
> +			 */
> +			blk->counters[PANTHOR_CTR_TIMESTAMP_LO] = block[PANTHOR_CTR_TIMESTAMP_LO];
> +			blk->counters[PANTHOR_CTR_TIMESTAMP_HI] = block[PANTHOR_CTR_TIMESTAMP_HI];
> +			blk->counters[PANTHOR_CTR_PRFCNT_EN] = block[PANTHOR_CTR_PRFCNT_EN];
> +
> +			for (size_t k = PANTHOR_HEADER_COUNTERS;
> +					k < ptdev->perf_info.counters_per_block;
> +					k++)
> +				blk->counters[k] += block[k];
> +
> +			ann_off += ann_block_size;
> +		}
> +	}
> +
> +	return i;

This will always return insert_idx, so why return it the caller when it makes no difference?

> +}
> +
> +static size_t panthor_perf_get_fw_reported_size(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +
> +	size_t fw_size = GLB_PERFCNT_FW_SIZE(glb_iface->control->perfcnt_size);
> +	size_t hw_size = GLB_PERFCNT_HW_SIZE(glb_iface->control->perfcnt_size);
> +	size_t md_size = PERFCNT_FEATURES_MD_SIZE(glb_iface->control->perfcnt_features);
> +
> +	return md_size + fw_size + hw_size;
> +}
> +
> +#define PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, typ, blk_count, offset) \
> +	({ \
> +		(desc)->blocks[(typ)].type = (typ); \
> +		(desc)->blocks[(typ)].offset = (offset); \
> +		(desc)->blocks[(typ)].block_count = (blk_count);  \
> +		if ((blk_count))                                    \
> +			set_bit((typ), (desc)->available_blocks); \
> +		(offset) + ((desc)->block_size) * (blk_count); \
> +	 })
> +
> +static int panthor_perf_setup_fw_buffer_desc(struct panthor_device *ptdev,
> +		struct panthor_perf_sampler *sampler)
> +{
> +	const struct drm_panthor_perf_info *const info = &ptdev->perf_info;
> +	const size_t block_size = info->counters_per_block * sizeof(u32);
> +	struct panthor_perf_buffer_descriptor *desc = &sampler->desc;
> +	const size_t fw_sample_size = panthor_perf_get_fw_reported_size(ptdev);
> +	size_t offset = 0;
> +
> +	desc->block_size = block_size;

block_size is only used in this assignment, so maybe do away with the automatic
variable altogether?

desc->block_size = info->counters_per_block * sizeof(u32);

Also, where is desc->available_blocks being set?


> +	for (enum drm_panthor_perf_block_type type = 0; type < DRM_PANTHOR_PERF_BLOCK_MAX; type++) {
> +		switch (type) {
> +		case DRM_PANTHOR_PERF_BLOCK_METADATA:
> +			if (info->flags & DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT)
> +				offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc,
> +					DRM_PANTHOR_PERF_BLOCK_METADATA, 1, offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_FW:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->fw_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_CSG:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->csg_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_CSHW:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->cshw_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_TILER:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->tiler_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_MEMSYS:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->memsys_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_SHADER:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->shader_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_MAX:
> +			drm_WARN_ON_ONCE(&ptdev->base,
> +					"DRM_PANTHOR_PERF_BLOCK_MAX should be unreachable!");
> +			break;
> +		}
> +	}

Maybe to spare some code, because 'type' gives you the u32 pointer offset minus one from
drm_panthor_perf_info::fw_blocks, you could do as follows:

   for (enum drm_panthor_perf_block_type type = 0; type < DRM_PANTHOR_PERF_BLOCK_MAX; type++) {
	switch (type) {
	case DRM_PANTHOR_PERF_BLOCK_METADATA:
		if (info->flags & DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT)
			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc,
				DRM_PANTHOR_PERF_BLOCK_METADATA, 1, offset);
		break;
	case DRM_PANTHOR_PERF_BLOCK_MAX:
		drm_WARN_ON_ONCE(&ptdev->base,
				"DRM_PANTHOR_PERF_BLOCK_MAX should be unreachable!");
		break;
	default:
		u32 blk_count = *((&info->fw_blocks)+(type-1));
		offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, blk_count, offset);
		break;
	}
}

I already suggested a similar thing in a previous patch review, but this would depend
on the compiler not inserting funky padding space between consecutive structure members,
but all of them being u32 I guess that's not possible?  

> +
> +	/* Computed size is not the same as the reported size, so we should not proceed in
> +	 * initializing the sampling session.
> +	 */
> +	if (offset != fw_sample_size)
> +		return -EINVAL;
> +
> +	desc->buffer_size = offset;
> +
> +	return 0;
> +}
> +
> +static int panthor_perf_fw_stop_sampling(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	u32 acked;
> +	int ret;
> +
> +	if (~READ_ONCE(glb_iface->input->req) & GLB_PERFCNT_ENABLE)
> +		return 0;
> +
> +	panthor_fw_update_reqs(glb_iface, req, 0, GLB_PERFCNT_ENABLE);
> +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> +	ret = panthor_fw_glb_wait_acks(ptdev, GLB_PERFCNT_ENABLE, &acked, 100);
> +	if (ret)
> +		drm_warn(&ptdev->base, "Could not disable performance counters");

After this, don't we have to set the selected counters to 0 through the fw's glb interface?

      	   glb_iface->input->* = 0;

> +
> +	return ret;
> +}
> +
> +static int panthor_perf_fw_start_sampling(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	u32 acked;
> +	int ret;
> +
> +	if (READ_ONCE(glb_iface->input->req) & GLB_PERFCNT_ENABLE)
> +		return 0;
> +
> +	panthor_fw_update_reqs(glb_iface, req, GLB_PERFCNT_ENABLE, GLB_PERFCNT_ENABLE);
> +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> +	ret = panthor_fw_glb_wait_acks(ptdev, GLB_PERFCNT_ENABLE, &acked, 100);
> +	if (ret)
> +		drm_warn(&ptdev->base, "Could not enable performance counters");

Wouldn't you have to enable counters in the FW's global input interface? As in
calling panthor_perf_fw_write_em(sampler, sampler->em)

> +
> +	return ret;
> +}
> +
> +static void panthor_perf_fw_write_em(struct panthor_perf_sampler *sampler,
> +		struct panthor_perf_enable_masks *em)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(sampler->ptdev);
> +	u32 perfcnt_config;
> +
> +	glb_iface->input->perfcnt_csf_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_CSHW]);

Wouldn't it be enough to get the lowest 32 bits of every bitmap for the actual mask?

> +	glb_iface->input->perfcnt_shader_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_SHADER]);
> +	glb_iface->input->perfcnt_mmu_l2_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_MEMSYS]);
> +	glb_iface->input->perfcnt_tiler_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_TILER]);
> +	glb_iface->input->perfcnt_fw_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_FW]);
> +	glb_iface->input->perfcnt_csg_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_CSG]);
> +
> +	perfcnt_config = GLB_PRFCNT_CONFIG_SIZE(PANTHOR_PERF_FW_RINGBUF_SLOTS);

Maybe replace this with
	perfcnt_config = GLB_PRFCNT_CONFIG_SIZE(sampler->sample_slots);
in case the number of slots is ever made a module parameter?

> +	perfcnt_config |= GLB_PRFCNT_CONFIG_SET(sampler->set_config);
> +	glb_iface->input->perfcnt_config = perfcnt_config;
> +
> +	/**
> +	 * The spec mandates that the host zero the PRFCNT_EXTRACT register before an enable
> +	 * operation, and each (re-)enable will require an enable-disable pair to program
> +	 * the new changes onto the FW interface.
> +	 */
> +	WRITE_ONCE(glb_iface->input->perfcnt_extract, 0);

 Wouldn't this better go in panthor_perf_fw_stop_sampling()?

> +}
> +
> +static void session_populate_sample_header(struct panthor_perf_session *session,
> +		struct drm_panthor_perf_sample_header *hdr)
> +{
> +	hdr->block_set = 0;
> +	hdr->user_data = session->user_data;
> +	hdr->timestamp_start_ns = session->sample_start_ns;
> +	/**
> +	 * TODO This should be changed to use the GPU clocks and the TIMESTAMP register,
> +	 * when support is added.
> +	 */

Timestamp register is already available and used in the driver.

> +	hdr->timestamp_end_ns = ktime_get_raw_ns();
> +}
> +
> +/**
> + * session_patch_sample - Update the PRFCNT_EN header counter and the counters exposed to the
> + *                        userspace client to only contain requested counters.
> + *
> + * @ptdev: Panthor device
> + * @session: Perf session
> + * @sample: Starting offset of the sample in the userspace mapping.
> + *
> + * The hardware supports counter selection at the granularity of 1 bit per 4 counters, and there
> + * is a single global FW frontend to program the counter requests from multiple sessions. This may
> + * lead to a large disparity between the requested and provided counters for an individual client.
> + * To remove this cross-talk, we patch out the counters that have not been requested by this
> + * session and update the PRFCNT_EN, the header counter containing a bitmask of enabled counters,
> + * accordingly.
> + */
> +static void session_patch_sample(struct panthor_device *ptdev,
> +		struct panthor_perf_session *session, u8 *sample)
> +{
> +	const struct drm_panthor_perf_info *const perf_info = &ptdev->perf_info;
> +
> +	const size_t block_size = get_annotated_block_size(perf_info->counters_per_block);
> +	const size_t sample_size = session_get_max_sample_size(perf_info);

I think maybe you want the fw ring sample size minus the user ring header:
const size_t blocks_size = sample_size - sizeof(struct drm_panthor_perf_sample_header);

> +	for (size_t i = 0; i < sample_size; i += block_size) {
Same as above:

	for (size_t i = 0; i < blocks_size; i += block_size) {

> +		size_t ctr_idx;
> +		DECLARE_BITMAP(em_diff, PANTHOR_PERF_EM_BITS);
> +		struct panthor_perf_counter_block *blk = (typeof(blk))(sample + block_size);
Same as above:
                struct panthor_perf_counter_block *blk = (typeof(blk))(sample + i);
Otherwise the iteration index isn't used.
> +		enum drm_panthor_perf_block_type type = blk->header.block_type;
> +		unsigned long *blk_em = session->enabled_counters->mask[type];
> +
> +		bitmap_from_arr64(em_diff, blk->header.enable_mask, PANTHOR_PERF_EM_BITS);
> +
> +		bitmap_andnot(em_diff, em_diff, blk_em, PANTHOR_PERF_EM_BITS);
> +
> +		for_each_set_bit(ctr_idx, em_diff, PANTHOR_PERF_EM_BITS)
> +			blk->counters[ctr_idx] = 0;
> +
> +		bitmap_to_arr64(blk->header.enable_mask, blk_em, PANTHOR_PERF_EM_BITS);
> +	}
> +}
> +
> +static int session_copy_samplei(struct panthor_device *ptdev,
> +		struct panthor_perf_session *session)
> +{
> +	struct panthor_perf *perf = ptdev->perf;
> +	const size_t sample_size = session_get_max_sample_size(&ptdev->perf_info);
> +	const u32 insert_idx = session_read_insert_idx(session);
> +	const u32 extract_idx = session_read_extract_idx(session);
> +	u8 *new_sample;
> +
> +	if (!CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots))
> +		return -ENOSPC;
> +
> +	new_sample = session->samples + extract_idx * sample_size;

Wouldn' this have to insert_idx?

> +
> +	memcpy(new_sample, perf->sampler.sample, sample_size);

So effectively we're doing two copies here, one from the FW ringbuffer into the
sampler, and another one from the sampler into the user ringbuffer. This sounds
expensive. This is in line with what I already mentioned above, essentially that
I thougth the FW ringbuffer would be somehow made available to UM samplers so
as to afford zero copy.           

> +	session_populate_sample_header(session,
> +			(struct drm_panthor_perf_sample_header *)new_sample);
> +
> +	session_patch_sample(ptdev, session, new_sample +
> +			sizeof(struct drm_panthor_perf_sample_header));
> +
> +	session_write_insert_idx(session, (insert_idx + 1) % session->ringbuf_slots);
> +
> +	/* Since we are about to notify userspace, we must ensure that all changes to memory
> +	 * are visible.
> +	 */
> +	wmb();
> +
> +	eventfd_signal(session->eventfd);
> +
> +	return 0;
> +}
> +
> +#define PERFCNT_IRQS (GLB_PERFCNT_OVERFLOW | GLB_PERFCNT_SAMPLE | GLB_PERFCNT_THRESHOLD)
> +
> +void panthor_perf_report_irq(struct panthor_device *ptdev, u32 status)
> +{
> +	struct panthor_perf *const perf = ptdev->perf;
> +	struct panthor_perf_sampler *sampler;
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +
> +	if (!(status & JOB_INT_GLOBAL_IF))
> +		return;
> +
> +	if (!perf)
> +		return;
> +
> +	sampler = &perf->sampler;
> +
> +	/* TODO This needs locking. */
> +	const u32 ack = READ_ONCE(glb_iface->output->ack);
> +	const u32 fw_events = sampler->last_ack ^ ack;

I think I would do it like this:

 	const u32 req = READ_ONCE(glb_iface->input->req);
 	const u32 ack = READ_ONCE(glb_iface->output->ack);

 	if (!(~(req ^ ack) & PERFCNT_IRQS))
 		return;

> +	sampler->last_ack = ack;
> +
> +	if (!(fw_events & PERFCNT_IRQS))
> +		return;
> +
> +	/* TODO Fix up the error handling for overflow. */
> +	if (fw_events & GLB_PERFCNT_OVERFLOW)
> +		return;
> +
> +	if (fw_events & (GLB_PERFCNT_SAMPLE | GLB_PERFCNT_THRESHOLD)) {
> +		const u32 extract_idx = READ_ONCE(glb_iface->input->perfcnt_extract);
> +		const u32 insert_idx = READ_ONCE(glb_iface->output->perfcnt_insert);
> +
> +		WRITE_ONCE(glb_iface->input->perfcnt_extract,
> +				panthor_perf_handle_sample(ptdev, extract_idx, insert_idx));
> +	}
> +
> +	scoped_guard(mutex, &sampler->sampler_lock)
> +	{
> +		struct list_head *pos, *temp;
> +
> +		list_for_each_safe(pos, temp, &sampler->sampler_list) {
> +			struct panthor_perf_session *session = list_entry(pos,
> +					struct panthor_perf_session, waiting);
> +
> +			session_copy_sample(ptdev, session);
> +			list_del_init(pos);

I guess in the case of samples being taken periodically, you would not delete
the session from the sampler's list?

> +
> +			session_put(session);
> +		}
> +	}
> +
> +	memset(sampler->sample, 0, session_get_max_sample_size(&ptdev->perf_info));

I don't think sampler->sample has the user sample header in it, so we might be
zero'ing out more memory than really needed.

> +	sampler->sample_requested = false;
> +	complete(&sampler->sample_handled);
> +}
> +
> +
> +static int panthor_perf_sampler_init(struct panthor_perf_sampler *sampler,
> +		struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	struct panthor_kernel_bo *bo;
> +	u8 *sample;
> +	int ret;
> +
> +	ret = panthor_perf_setup_fw_buffer_desc(ptdev, sampler);
> +	if (ret) {
> +		drm_err(&ptdev->base,
> +				"Failed to setup descriptor for FW ring buffer, err = %d", ret);
> +		return ret;
> +	}
> +
> +	bo = panthor_kernel_bo_create(ptdev, panthor_fw_vm(ptdev),
> +			sampler->desc.buffer_size * PANTHOR_PERF_FW_RINGBUF_SLOTS,
> +			DRM_PANTHOR_BO_NO_MMAP,
> +			DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> +			PANTHOR_VM_KERNEL_AUTO_VA);
> +
> +	if (IS_ERR_OR_NULL(bo))
> +		return IS_ERR(bo) ? PTR_ERR(bo) : -ENOMEM;
> +
> +	ret = panthor_kernel_bo_vmap(bo);
> +	if (ret)
> +		goto cleanup_bo;
> +
> +	sample = devm_kzalloc(ptdev->base.dev,
> +			session_get_max_sample_size(&ptdev->perf_info), GFP_KERNEL);
> +	if (ZERO_OR_NULL_PTR(sample)) {
> +		ret = -ENOMEM;
> +		goto cleanup_vmap;
> +	}
> +
> +	glb_iface->input->perfcnt_as = panthor_vm_as(panthor_fw_vm(ptdev));
> +	glb_iface->input->perfcnt_base = panthor_kernel_bo_gpuva(bo);
> +	glb_iface->input->perfcnt_extract = 0;
> +	glb_iface->input->perfcnt_csg_select = GENMASK(glb_iface->control->group_num, 0);
> +
> +	sampler->rb = bo;
> +	sampler->sample = sample;
> +	sampler->sample_slots = PANTHOR_PERF_FW_RINGBUF_SLOTS;

I already mentioned this, but I think it'd be nice to have this configurable
through a module parameter.

Also, I suspect you might've forgotten this assignment right here:

	sampler->sample_size = sampler->desc.buffer_size;

> +
> +	sampler->em = panthor_perf_em_new();
> +
> +	mutex_init(&sampler->sampler_lock);
> +	mutex_init(&sampler->config_lock);
> +	INIT_LIST_HEAD(&sampler->sampler_list);
> +	INIT_LIST_HEAD(&sampler->ems);
> +	init_completion(&sampler->sample_handled);
> +
> +	sampler->ptdev = ptdev;
> +
> +	return 0;
> +
> +cleanup_vmap:
> +	panthor_kernel_bo_vunmap(bo);
> +
> +cleanup_bo:
> +	panthor_kernel_bo_destroy(bo);
> +
> +	return ret;
> +}
> +
> +static void panthor_perf_sampler_term(struct panthor_perf_sampler *sampler)
> +{
> +	int ret;
> +
> +	if (sampler->sample_requested)
> +		wait_for_completion_killable(&sampler->sample_handled);

Wouldn't it be better to wait until a certain deadline?

> +	panthor_perf_fw_write_em(sampler, &(struct panthor_perf_enable_masks) {});

I would handle this through a little helper and call it panthor_perf_fw_zero_em
because other calls of panthor_perf_fw_write_em() always feature the same arguments.

> +	ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
> +	if (ret)
> +		drm_warn_once(&sampler->ptdev->base, "Sampler termination failed, ret = %d", ret);
> +
> +	devm_kfree(sampler->ptdev->base.dev, sampler->sample);
> +
> +	panthor_kernel_bo_destroy(sampler->rb);
> +}
> +
> +static int panthor_perf_sampler_add(struct panthor_perf_sampler *sampler,
> +		struct panthor_perf_enable_masks *const new_em,
> +		u8 set)
> +{
> +	int ret = 0;
> +
> +	guard(mutex)(&sampler->config_lock);
> +
> +	/* Early check for whether a new set can be configured. */
> +	if (!atomic_read(&sampler->enabled_clients))
> +		sampler->set_config = set;
> +	else
> +		if (sampler->set_config != set)
> +			return -EBUSY;
> +
> +	kref_get(&new_em->refs);
> +	list_add_tail(&sampler->ems, &new_em->link);
> +
> +	panthor_perf_em_add(sampler->em, new_em);
> +	pm_runtime_get_sync(sampler->ptdev->base.dev);
> +
> +	if (atomic_read(&sampler->enabled_clients)) {
> +		ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
> +		if (ret)
> +			return ret;

What happens in the case of a manual session sample? Does that mean it'll get
interrupted? I guess if we use eventfd it's not a problem, because UM will be
notified of when the sample it's made available?

> +	}
> +
> +	panthor_perf_fw_write_em(sampler, sampler->em);

Almost every single use of panthor_perf_fw_write_em is called with
the same arguments, so I would define a helper that passes sampler->em underneath,
maybe something like panthor_perf_fw_enable_sampler_mask

> +	ret = panthor_perf_fw_start_sampling(sampler->ptdev);
> +	if (ret)
> +		return ret;

iWwouldn't you do this only if if (atomic_read(&sampler->enabled_clients) > 0) ?
Unless you want to start sampling now rather than waiting for session start.

> +	atomic_inc(&sampler->enabled_clients);
> +
> +	return 0;
> +}
> +
> +static int panthor_perf_sampler_remove(struct panthor_perf_sampler *sampler,
> +		struct panthor_perf_enable_masks *session_em)

I might prefer to call this sampler_remove_session.

> +{
> +	int ret;
> +	struct list_head *em_node;
> +
> +	guard(mutex)(&sampler->config_lock);
> +
> +	list_del_init(&session_em->link);
> +	kref_put(&session_em->refs, panthor_perf_destroy_em_kref);
> +
> +	panthor_perf_em_zero(sampler->em);
> +	list_for_each(em_node, &sampler->ems)
> +	{
> +		struct panthor_perf_enable_masks *curr_em =
> +			container_of(em_node, typeof(*curr_em), link);
> +
> +		panthor_perf_em_add(sampler->em, curr_em);
> +	}
> +
> +	ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
> +	if (ret)
> +		return ret;
> +
> +	atomic_dec(&sampler->enabled_clients);
> +	pm_runtime_put_sync(sampler->ptdev->base.dev);
> +
> +	panthor_perf_fw_write_em(sampler, sampler->em);
> +
> +	if (atomic_read(&sampler->enabled_clients))
> +		return panthor_perf_fw_start_sampling(sampler->ptdev);
> +	return 0;
> +}
> +
>  /**
>   * panthor_perf_init - Initialize the performance counter subsystem.
>   * @ptdev: Panthor device
> @@ -370,6 +1140,7 @@ static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *co
>  int panthor_perf_init(struct panthor_device *ptdev)
>  {
>  	struct panthor_perf *perf;
> +	int ret;
>  
>  	if (!ptdev)
>  		return -EINVAL;
> @@ -386,12 +1157,93 @@ int panthor_perf_init(struct panthor_device *ptdev)
>  		.max = 1,
>  	};
>  
> +	ret = panthor_perf_sampler_init(&perf->sampler, ptdev);
> +	if (ret)
> +		goto cleanup_perf;
> +
>  	drm_info(&ptdev->base, "Performance counter subsystem initialized");
>  
>  	ptdev->perf = perf;
>  
> -	return 0;
> +	return ret;
> +
> +cleanup_perf:
> +	devm_kfree(ptdev->base.dev, perf);
> +
> +	return ret;
> +}
> +

Nit: spurious blank line

> +static void panthor_perf_fw_request_sample(struct panthor_perf_sampler *sampler)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(sampler->ptdev);
> +
> +	panthor_fw_toggle_reqs(glb_iface, req, ack, GLB_PERFCNT_SAMPLE);
> +	gpu_write(sampler->ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> +}

Is there a way to reset the counters to zero without requesting a clearing sample?

> +/**
> + * panthor_perf_sampler_request_clearing - Request a clearing sample.
> + * @sampler: Panthor sampler
> + *
> + * Perform a synchronous sample that gets immediately discarded. This sets a baseline at the point
> + * of time a new session is started, to avoid having counters from before the session.
> + *
> + */
> +static int panthor_perf_sampler_request_clearing(struct panthor_perf_sampler *sampler)
> +{
> +	scoped_guard(mutex, &sampler->sampler_lock) {
> +		if (!sampler->sample_requested) {
> +			panthor_perf_fw_request_sample(sampler);
> +			sampler->sample_requested = true;
> +		}
> +	}
> +
> +	return wait_for_completion_timeout(&sampler->sample_handled,
> +			msecs_to_jiffies(1000));
> +}
> +
> +/**
> + * panthor_perf_sampler_request_sample - Request a counter sample for the userspace client.
> + * @sampler: Panthor sampler
> + * @session: Target session
> + *
> + * A session that has already requested a sample cannot request another one until the previous
> + * sample has been delivered.
> + *
> + * Return:
> + * * %0       - The sample has been requested successfully.
> + * * %-EBUSY  - The target session has already requested a sample and has not received it yet.
> + */
> +static int panthor_perf_sampler_request_sample(struct panthor_perf_sampler *sampler,
> +		struct panthor_perf_session *session)
> +{
> +	struct list_head *head;
> +
> +	reinit_completion(&sampler->sample_handled);

Shouldn't you wait until checking whether a sample has already been requested?

> +	guard(mutex)(&sampler->sampler_lock);
> +
> +	/*
> +	 * If a previous sample has not been handled yet, the session cannot request another
> +	 * sample. If this happens too often, the requested sample rate is too high.
> +	 */
> +	list_for_each(head, &sampler->sampler_list) {
> +		struct panthor_perf_session *cur_session = list_entry(head,
> +				typeof(*cur_session), waiting);
> +
> +		if (session == cur_session)
> +			return -EBUSY;
> +	}
> +
> +	if (list_empty(&sampler->sampler_list) && !sampler->sample_requested)
> +		panthor_perf_fw_request_sample(sampler);
>  
> +	sampler->sample_requested = true;
> +	list_add_tail(&session->waiting, &sampler->sampler_list);
> +	session_get(session);
> +
> +	return 0;
>  }
>  
>  static int session_validate_set(u8 set)
> @@ -483,7 +1335,12 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
>  		goto cleanup_eventfd;
>  	}
>  
> +	ret = panthor_perf_sampler_add(&perf->sampler, em, setup_args->block_set);
> +	if (ret)
> +		goto cleanup_em;
> +
>  	INIT_LIST_HEAD(&session->waiting);
> +
>  	session->extract_idx = ctrl_map.vaddr;
>  	*session->extract_idx = 0;
>  	session->insert_idx = session->extract_idx + 1;
> @@ -507,12 +1364,15 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
>  	ret = xa_alloc_cyclic(&perf->sessions, &session_id, session, perf->session_range,
>  			&perf->next_session, GFP_KERNEL);
>  	if (ret < 0)
> -		goto cleanup_em;
> +		goto cleanup_sampler_add;
>  
>  	kref_init(&session->ref);
>  
>  	return session_id;
>  
> +cleanup_sampler_add:
> +	panthor_perf_sampler_remove(&perf->sampler, em);
> +
>  cleanup_em:
>  	kref_put(&em->refs, panthor_perf_destroy_em_kref);
>  
> @@ -540,6 +1400,8 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
>  static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *session,
>  		u64 user_data)
>  {
> +	int ret;
> +
>  	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
>  		return 0;
>  
> @@ -552,6 +1414,10 @@ static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *
>  
>  	session->user_data = user_data;
>  
> +	ret = panthor_perf_sampler_request_sample(&perf->sampler, session);
> +	if (ret)
> +		return ret;
> +
>  	clear_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
>  
>  	/* TODO Calls to the FW interface will go here in later patches. */
> @@ -573,8 +1439,7 @@ static int session_start(struct panthor_perf *perf, struct panthor_perf_session
>  	if (session->sample_freq_ns)
>  		session->user_data = user_data;
>  
> -	/* TODO Calls to the FW interface will go here in later patches. */
> -	return 0;
> +	return panthor_perf_sampler_request_clearing(&perf->sampler);
>  }
>  
>  static int session_sample(struct panthor_perf *perf, struct panthor_perf_session *session,
> @@ -601,15 +1466,16 @@ static int session_sample(struct panthor_perf *perf, struct panthor_perf_session
>  	session->sample_start_ns = ktime_get_raw_ns();
>  	session->user_data = user_data;
>  
> -	/* TODO Calls to the FW interface will go here in later patches. */
> -	return 0;
> +	return panthor_perf_sampler_request_sample(&perf->sampler, session);
>  }
>  
>  static int session_destroy(struct panthor_perf *perf, struct panthor_perf_session *session)
>  {
> +	int ret = panthor_perf_sampler_remove(&perf->sampler, session->enabled_counters);
> +
>  	session_put(session);
>  
> -	return 0;
> +	return ret;
>  }
>  
>  static int session_teardown(struct panthor_perf *perf, struct panthor_perf_session *session)
> @@ -813,6 +1679,8 @@ void panthor_perf_unplug(struct panthor_device *ptdev)
>  
>  	xa_destroy(&perf->sessions);
>  
> +	panthor_perf_sampler_term(&perf->sampler);
> +
>  	devm_kfree(ptdev->base.dev, ptdev->perf);
>  
>  	ptdev->perf = NULL;
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
> index bfef8874068b..3485e4a55e15 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.h
> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
> @@ -31,4 +31,6 @@ int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf
>  		u32 sid, u64 user_data);
>  void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf);
>  
> +void panthor_perf_report_irq(struct panthor_device *ptdev, u32 status);
> +
>  #endif /* __PANTHOR_PERF_H__ */
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> index 576d3ad46e6d..a29b755d6556 100644
> --- a/include/uapi/drm/panthor_drm.h
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -441,8 +441,11 @@ enum drm_panthor_perf_feat_flags {
>   * enum drm_panthor_perf_block_type - Performance counter supported block types.
>   */
>  enum drm_panthor_perf_block_type {
> +	/** DRM_PANTHOR_PERF_BLOCK_METADATA: Internal use only. */
> +	DRM_PANTHOR_PERF_BLOCK_METADATA = 0,
> +
>  	/** @DRM_PANTHOR_PERF_BLOCK_FW: The FW counter block. */
> -	DRM_PANTHOR_PERF_BLOCK_FW = 1,
> +	DRM_PANTHOR_PERF_BLOCK_FW,
>  
>  	/** @DRM_PANTHOR_PERF_BLOCK_CSG: A CSG counter block. */
>  	DRM_PANTHOR_PERF_BLOCK_CSG,
> -- 
> 2.25.1


Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 7/8] drm/panthor: Add suspend/resume handling for the performance counters
  2024-12-11 16:50 ` [RFC v2 7/8] drm/panthor: Add suspend/resume handling for the performance counters Lukas Zapolskas
@ 2025-01-27 20:06   ` Adrián Larumbe
  2025-03-27  8:57     ` Lukas Zapolskas
  0 siblings, 1 reply; 28+ messages in thread
From: Adrián Larumbe @ 2025-01-27 20:06 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.c |  3 +
>  drivers/gpu/drm/panthor/panthor_perf.c   | 86 ++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_perf.h   |  2 +
>  3 files changed, 91 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index 1a81a436143b..69536fbdb5ef 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -475,6 +475,7 @@ int panthor_device_resume(struct device *dev)
>  		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
>  		if (!ret) {
>  			panthor_sched_resume(ptdev);
> +			panthor_perf_resume(ptdev);
>  		} else {
>  			panthor_mmu_suspend(ptdev);
>  			panthor_gpu_suspend(ptdev);
> @@ -543,6 +544,7 @@ int panthor_device_suspend(struct device *dev)
>  	    drm_dev_enter(&ptdev->base, &cookie)) {
>  		cancel_work_sync(&ptdev->reset.work);
>  
> +		panthor_perf_suspend(ptdev);
>  		/* We prepare everything as if we were resetting the GPU.
>  		 * The end of the reset will happen in the resume path though.
>  		 */
> @@ -561,6 +563,7 @@ int panthor_device_suspend(struct device *dev)
>  			panthor_mmu_resume(ptdev);
>  			drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
>  			panthor_sched_resume(ptdev);
> +			panthor_perf_resume(ptdev);
>  			drm_dev_exit(cookie);
>  		}
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
> index d62d97c448da..727e66074eab 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.c
> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
> @@ -433,6 +433,17 @@ static void panthor_perf_em_zero(struct panthor_perf_enable_masks *em)
>  		bitmap_zero(em->mask[i], PANTHOR_PERF_EM_BITS);
>  }
>  
> +static bool panthor_perf_em_empty(const struct panthor_perf_enable_masks *const em)
> +{
> +	bool empty = true;
> +	size_t i = 0;
> +
> +	for (i = DRM_PANTHOR_PERF_BLOCK_FW; i <= DRM_PANTHOR_PERF_BLOCK_LAST; i++)
> +		empty &= bitmap_empty(em->mask[i], PANTHOR_PERF_EM_BITS);
> +
> +	return empty;
> +}
> +
>  static void panthor_perf_destroy_em_kref(struct kref *em_kref)
>  {
>  	struct panthor_perf_enable_masks *em = container_of(em_kref, typeof(*em), refs);
> @@ -1652,6 +1663,81 @@ void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_per
>  	}
>  }
>  
> +static int panthor_perf_sampler_resume(struct panthor_perf_sampler *sampler)
> +{
> +	int ret;
> +
> +	if (!atomic_read(&sampler->enabled_clients))
> +		return 0;
> +
> +	if (!panthor_perf_em_empty(sampler->em)) {
> +		guard(mutex)(&sampler->config_lock);
> +		panthor_perf_fw_write_em(sampler, sampler->em);
> +	}

Aren't panthor_perf_em_empty(sampler->em) and !atomic_read(&sampler->enabled_clients) functionally equivalent?

> +
> +	ret = panthor_perf_fw_start_sampling(sampler->ptdev);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +static int panthor_perf_sampler_suspend(struct panthor_perf_sampler *sampler)
> +{
> +	int ret;
> +
> +	if (!atomic_read(&sampler->enabled_clients))
> +		return 0;
> +
> +	ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_perf_suspend - Prepare the performance counter subsystem for system suspend.
> + * @ptdev: Panthor device.
> + *
> + * Indicate to the performance counters that the system is suspending.
> + *
> + * This function must not be used to handle MCU power state transitions: just before MCU goes
> + * from on to any inactive state, an automatic sample will be performed by the firmware, and
> + * the performance counter firmware state will be restored on warm boot.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_suspend(struct panthor_device *ptdev)
> +{
> +	struct panthor_perf *perf = ptdev->perf;
> +
> +	if (!perf)
> +		return 0;
> +
> +	return panthor_perf_sampler_suspend(&perf->sampler);
> +}
> +
> +/**
> + * panthor_perf_resume - Resume the performance counter subsystem after system resumption.
> + * @ptdev: Panthor device.
> + *
> + * Indicate to the performance counters that the system has resumed. This must not be used
> + * to handle MCU state transitions, for the same reasons as detailed in the kerneldoc for
> + * @panthor_perf_suspend.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_resume(struct panthor_device *ptdev)
> +{
> +	struct panthor_perf *perf = ptdev->perf;
> +
> +	if (!perf)
> +		return 0;
> +
> +	return panthor_perf_sampler_resume(&perf->sampler);
> +}
> +
>  /**
>   * panthor_perf_unplug - Terminate the performance counter subsystem.
>   * @ptdev: Panthor device.
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
> index 3485e4a55e15..a22a511a0809 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.h
> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
> @@ -16,6 +16,8 @@ struct panthor_perf;
>  void panthor_perf_info_init(struct panthor_device *ptdev);
>  
>  int panthor_perf_init(struct panthor_device *ptdev);
> +int panthor_perf_suspend(struct panthor_device *ptdev);
> +int panthor_perf_resume(struct panthor_device *ptdev);
>  void panthor_perf_unplug(struct panthor_device *ptdev);
>  
>  int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
> -- 
> 2.25.1

Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 8/8] drm/panthor: Expose the panthor perf ioctls
  2024-12-11 16:50 ` [RFC v2 8/8] drm/panthor: Expose the panthor perf ioctls Lukas Zapolskas
@ 2025-01-27 20:14   ` Adrián Larumbe
  2025-03-27  8:58     ` Lukas Zapolskas
  0 siblings, 1 reply; 28+ messages in thread
From: Adrián Larumbe @ 2025-01-27 20:14 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

I don't know what the usual practice is when adding a new DRM driver ioctl(), but wouldn't it make
more sense to add the PERF_CONTROL one to the panthor_drm_driver_ioctls array in this patch instead?

Other than that:

Reviewed-by: Adrián Larumbe <adrian.larumbe@collabora.com>

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  drivers/gpu/drm/panthor/panthor_drv.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index 2848ab442d10..ef081a383fa9 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -1654,6 +1654,8 @@ static void panthor_debugfs_init(struct drm_minor *minor)
>   * - 1.1 - adds DEV_QUERY_TIMESTAMP_INFO query
>   * - 1.2 - adds DEV_QUERY_GROUP_PRIORITIES_INFO query
>   *       - adds PANTHOR_GROUP_PRIORITY_REALTIME priority
> + * - 1.3 - adds DEV_QUERY_PERF_INFO query
> + *         adds PERF_CONTROL ioctl
>   */
>  static const struct drm_driver panthor_drm_driver = {
>  	.driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
> @@ -1667,7 +1669,7 @@ static const struct drm_driver panthor_drm_driver = {
>  	.name = "panthor",
>  	.desc = "Panthor DRM driver",
>  	.major = 1,
> -	.minor = 2,
> +	.minor = 3,
>  
>  	.gem_create_object = panthor_gem_create_object,
>  	.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
> -- 
> 2.25.1


Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 6/8] drm/panthor: Implement the counter sampler and sample handling
  2024-12-11 16:50 ` [RFC v2 6/8] drm/panthor: Implement the counter sampler and sample handling Lukas Zapolskas
  2025-01-27 16:53   ` Adrián Larumbe
@ 2025-01-27 21:09   ` Adrián Larumbe
  1 sibling, 0 replies; 28+ messages in thread
From: Adrián Larumbe @ 2025-01-27 21:09 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

Resending partial review for patch 6/8 because of typing errors in the previous one.
Sorry about the hassle this might've caused.

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> From: Adrián Larumbe <adrian.larumbe@collabora.com>
> 
> The sampler aggregates counter and set requests coming from userspace
> and mediates interactions with the FW interface, to ensure that user
> sessions cannot override the global configuration.
> 
> From the top-level interface, the sampler supports two different types
> of samples: clearing samples and regular samples. Clearing samples are
> a special sample type that allow for the creation of a sampling
> baseline, to ensure that a session does not obtain counter data from
> before its creation.
> 
> Upon receipt of a relevant interrupt, corresponding to one of the three
> relevant bits of the GLB_ACK register, the sampler takes any samples
> that occurred, and, based on the insert and extract indices, accumulates
> them to an internal storage buffer after zero-extending the counters
> from the 32-bit counters emitted by the hardware to 64-bit counters
> for internal accumulation.
> 
> When the performance counters are enabled, the FW ensures no counter
> data is lost when entering and leaving non-counting regions by producing
> automatic samples that do not correspond to a GLB_REQ.PRFCNT_SAMPLE
> request. Such regions may be per hardware unit, such as when a shader
> core powers down, or global. Most of these events do not directly
> correspond to session sample requests, so any intermediary counter data
> must be stored into a temporary accumulation buffer.
> 
> If there are sessions waiting for a sample, this accumulated buffer will
> be taken, and emitted for each waiting client. During this phase,
> information like the timestamps of sample request and sample emission,
> type of the counter block and block index annotations are added to the
> sample header and block headers. If no sessions are waiting for
> a sample, this accumulation buffer is kept until the next time a sample
> is requested.
> 
> Special handling is needed for the PRFCNT_OVERFLOW interrupt, which is
> an indication that the internal sample handling rate was insufficient.
> 
> The sampler also maintains a buffer descriptor indicating the structure
> of a firmware sample, since neither the firmware nor the hardware give
> any indication of the sample structure, only that it is composed out of
> three parts:
>  - the metadata is an optional initial counter block on supporting
>    firmware versions that contains a single counter, indicating the
>    reason a sample was taken when entering global non-counting regions.
>    This is used to provide coarse-grained information about why a sample
>    was taken to userspace, to help userspace interpret variations in
>    counter magnitude.
>  - the firmware component of the sample is composed out of a global
>    firmware counter block on supporting firmware versions.
>  - the hardware component is the most sizeable of the three and contains
>    a block of counters for each of the underlying hardware resources. It
>    has a fixed structure that is described in the architecture
>    specification, and contains the command stream hardware block(s), the
>    tiler block(s), the MMU and L2 blocks (collectively named the memsys
>    blocks) and the shader core blocks, in that order.
> The structure of this buffer changes based on the firmware and hardware
> combination, but is constant on a single system.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> Co-developed-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  drivers/gpu/drm/panthor/panthor_fw.c   |   5 +
>  drivers/gpu/drm/panthor/panthor_fw.h   |   9 +-
>  drivers/gpu/drm/panthor/panthor_perf.c | 882 ++++++++++++++++++++++++-
>  drivers/gpu/drm/panthor/panthor_perf.h |   2 +
>  include/uapi/drm/panthor_drm.h         |   5 +-
>  5 files changed, 892 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> index e9530d1d9781..cd68870ced18 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -1000,9 +1000,12 @@ static void panthor_fw_init_global_iface(struct panthor_device *ptdev)
>  
>  	/* Enable interrupts we care about. */
>  	glb_iface->input->ack_irq_mask = GLB_CFG_ALLOC_EN |
> +					 GLB_PERFCNT_SAMPLE |
>  					 GLB_PING |
>  					 GLB_CFG_PROGRESS_TIMER |
>  					 GLB_CFG_POWEROFF_TIMER |
> +					 GLB_PERFCNT_THRESHOLD |
> +					 GLB_PERFCNT_OVERFLOW |
>  					 GLB_IDLE_EN |
>  					 GLB_IDLE;
>  
> @@ -1031,6 +1034,8 @@ static void panthor_job_irq_handler(struct panthor_device *ptdev, u32 status)
>  		return;
>  
>  	panthor_sched_report_fw_events(ptdev, status);
> +
> +	panthor_perf_report_irq(ptdev, status);
>  }
>  PANTHOR_IRQ_HANDLER(job, JOB, panthor_job_irq_handler);
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.h b/drivers/gpu/drm/panthor/panthor_fw.h
> index db10358e24bb..7ed34d2de8b4 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.h
> +++ b/drivers/gpu/drm/panthor/panthor_fw.h
> @@ -199,9 +199,10 @@ struct panthor_fw_global_control_iface {
>  	u32 group_num;
>  	u32 group_stride;
>  #define GLB_PERFCNT_FW_SIZE(x) ((((x) >> 16) << 8))
> +#define GLB_PERFCNT_HW_SIZE(x) (((x) & GENMASK(15, 0)) << 8)
>  	u32 perfcnt_size;
>  	u32 instr_features;
> -#define PERFCNT_FEATURES_MD_SIZE(x) ((x) & GENMASK(3, 0))
> +#define PERFCNT_FEATURES_MD_SIZE(x) (((x) & GENMASK(3, 0)) << 8)
>  	u32 perfcnt_features;
>  };
>  
> @@ -211,7 +212,7 @@ struct panthor_fw_global_input_iface {
>  #define GLB_CFG_ALLOC_EN			BIT(2)
>  #define GLB_CFG_POWEROFF_TIMER			BIT(3)
>  #define GLB_PROTM_ENTER				BIT(4)
> -#define GLB_PERFCNT_EN				BIT(5)
> +#define GLB_PERFCNT_ENABLE			BIT(5)
>  #define GLB_PERFCNT_SAMPLE			BIT(6)
>  #define GLB_COUNTER_EN				BIT(7)
>  #define GLB_PING				BIT(8)
> @@ -234,7 +235,6 @@ struct panthor_fw_global_input_iface {
>  	u32 doorbell_req;
>  	u32 reserved1;
>  	u32 progress_timer;
> -
>  #define GLB_TIMER_VAL(x)			((x) & GENMASK(30, 0))
>  #define GLB_TIMER_SOURCE_GPU_COUNTER		BIT(31)
>  	u32 poweroff_timer;
> @@ -244,6 +244,9 @@ struct panthor_fw_global_input_iface {
>  	u64 perfcnt_base;
>  	u32 perfcnt_extract;
>  	u32 reserved3[3];
> +#define GLB_PRFCNT_CONFIG_SIZE(x) ((x) & GENMASK(7, 0))
> +#define GLB_PRFCNT_CONFIG_SET(x) (((x) & GENMASK(1, 0)) << 8)
> +#define GLB_PRFCNT_METADATA_ENABLE BIT(10)
>  	u32 perfcnt_config;
>  	u32 perfcnt_csg_select;
>  	u32 perfcnt_fw_enable;
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
> index 42d8b6f8c45d..d62d97c448da 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.c
> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
> @@ -15,7 +15,9 @@
>  
>  #include "panthor_device.h"
>  #include "panthor_fw.h"
> +#include "panthor_gem.h"
>  #include "panthor_gpu.h"
> +#include "panthor_mmu.h"
>  #include "panthor_perf.h"
>  #include "panthor_regs.h"
>  
> @@ -26,6 +28,41 @@
>   */
>  #define PANTHOR_PERF_EM_BITS (BITS_PER_TYPE(u64) * 2)
>  
> +/**
> + * PANTHOR_PERF_FW_RINGBUF_SLOTS - Number of slots allocated for individual samples when configuring
> + *                                 the performance counter ring buffer to firmware. This can be
> + *                                 used to reduce memory consumption on low memory systems.
> + */
> +#define PANTHOR_PERF_FW_RINGBUF_SLOTS (32)
> +
> +/**
> + * PANTHOR_CTR_TIMESTAMP_LO - The first architecturally mandated counter of every block type
> + *                            contains the low 32-bits of the TIMESTAMP value.
> + */
> +#define PANTHOR_CTR_TIMESTAMP_LO (0)
> +
> +/**
> + * PANTHOR_CTR_TIMESTAMP_HI - The register offset containinig the high 32-bits of the TIMESTAMP
> + *                            value.
> + */
> +#define PANTHOR_CTR_TIMESTAMP_HI (1)
> +
> +/**
> + * PANTHOR_CTR_PRFCNT_EN - The register offset containing the enable mask for the enabled counters
> + *                         that were written to memory.
> + */
> +#define PANTHOR_CTR_PRFCNT_EN (2)
> +
> +/**
> + * PANTHOR_HEADER_COUNTERS - The first four counters of every block type are architecturally
> + *                           defined to be equivalent. The fourth counter is always reserved,
> + *                           and should be zero and as such, does not have a separate define.
> + *
> + *                           These are the only four counters that are the same between different
> + *                           blocks and are consistent between different architectures.
> + */
> +#define PANTHOR_HEADER_COUNTERS (4)
> +
>  /**
>   * enum panthor_perf_session_state - Session state bits.
>   */
> @@ -158,6 +195,135 @@ struct panthor_perf_session {
>  	struct kref ref;
>  };
>  
> +struct panthor_perf_buffer_descriptor {
> +	/**
> +	 * @block_size: The size of a single block in the FW ring buffer, equal to
> +	 *              sizeof(u32) * counters_per_block.
> +	 */
> +	size_t block_size;
> +
> +	/**
> +	 * @buffer_size: The total size of the buffer, equal to (#hardware blocks +
> +	 *               #firmware blocks) * block_size.
> +	 */
> +	size_t buffer_size;
> +
> +	/**
> +	 * @available_blocks: Bitmask indicating the blocks supported by the hardware and firmware
> +	 *                    combination. Note that this can also include blocks that will not
> +	 *                    be exposed to the user.
> +	 */
> +	DECLARE_BITMAP(available_blocks, DRM_PANTHOR_PERF_BLOCK_MAX);
> +	struct {
> +		/** @offset: Starting offset of a block of type @type in the FW ringbuffer. */
> +		size_t offset;
> +
> +		/** @type: Type of the blocks between @blocks[i].offset and @blocks[i+1].offset. */
> +		enum drm_panthor_perf_block_type type;

I think perhaps you could avoid declaring the type member, because a block type is the same as its
index in the blocks array. See [1]

> +		/** @block_count: Number of blocks of the given @type, starting at @offset. */
> +		size_t block_count;
> +	} blocks[DRM_PANTHOR_PERF_BLOCK_MAX];
> +};
> +
> +
> +/**
> + * struct panthor_perf_sampler - Interface to de-multiplex firmware interaction and handle
> + *                               global interactions.
> + */
> +struct panthor_perf_sampler {
> +	/** @sample_requested: A sample has been requested. */
> +	bool sample_requested;
> +
> +	/**
> +	 * @last_ack: Temporarily storing the last GLB_ACK status. Without storing this data,
> +	 *            we do not know whether a toggle bit has been handled.
> +	 */
> +	u32 last_ack;
> +
> +	/**
> +	 * @enabled_clients: The number of clients concurrently requesting samples. To ensure that
> +	 *                   one client cannot deny samples to another, we must ensure that clients
> +	 *                   are effectively reference counted.
> +	 */
> +	atomic_t enabled_clients;
> +
> +	/**
> +	 * @sample_handled: Synchronization point between the interrupt bottom half and the
> +	 *                  main sampler interface. Must be re-armed solely on a new request
> +	 *                  coming to the sampler.
> +	 */
> +	struct completion sample_handled;
> +
> +	/** @rb: Kernel BO in the FW AS containing the sample ringbuffer. */
> +	struct panthor_kernel_bo *rb;
> +
> +	/**
> +	 * @sample_size: The size of a single sample in the FW ringbuffer. This is computed using
> +	 *               the hardware configuration according to the architecture specification,
> +	 *               and cross-validated against the sample size reported by FW to ensure
> +	 *               a consistent view of the buffer size.
> +	 */
> +	size_t sample_size;
> +
> +	/**
> +	 * @sample_slots: Number of slots for samples in the FW ringbuffer. Could be static,
> +	 *		  but may be useful to customize for low-memory devices.
> +	 */
> +	size_t sample_slots;
> +
> +	/**
> +	 * @config_lock: Lock serializing changes to the global counter configuration, including
> +	 *               requested counter set and the counters themselves.
> +	 */
> +	struct mutex config_lock;
> +
> +	/**
> +	 * @ems: List of enable maps of the active sessions. When removing a session, the number
> +	 *       of requested counters may decrease, and the union of enable masks from the multiple
> +	 *       sessions does not provide sufficient information to reconstruct the previous
> +	 *       enable mask.
> +	 */
> +	struct list_head ems;
> +
> +	/** @em: Combined enable mask for all of the active sessions. */
> +	struct panthor_perf_enable_masks *em;
> +
> +	/**
> +	 * @desc: Buffer descriptor for a sample in the FW ringbuffer. Note that this buffer
> +	 *        at current time does some interesting things with the zeroth block type. On
> +	 *        newer FW revisions, the first counter block of the sample is the METADATA block,
> +	 *        which contains a single value indicating the reason the sample was taken (if
> +	 *        any). This block must not be exposed to userspace, as userspace does not
> +	 *        have sufficient context to interpret it. As such, this block type is not
> +	 *        added to the uAPI, but we still use it in the kernel.
> +	 */
> +	struct panthor_perf_buffer_descriptor desc;
> +
> +	/**
> +	 * @sample: Pointer to an upscaled and annotated sample that may be emitted to userspace.
> +	 *          This is used both as an intermediate buffer to do the zero-extension of the
> +	 *          32-bit counters to 64-bits and as a storage buffer in case the sampler
> +	 *          requests an additional sample that was not requested by any of the top-level
> +	 *          sessions (for instance, when changing the enable masks).
> +	 */
> +	u8 *sample;
> +
> +	/** @sampler_lock: Lock used to guard the list of sessions requesting samples. */
> +	struct mutex sampler_lock;
> +
> +	/** @sampler_list: List of sessions requesting samples. */
> +	struct list_head sampler_list;
> +
> +	/** @set_config: The set that will be configured onto the hardware. */
> +	u8 set_config;
> +
> +	/**
> +	 * @ptdev: Backpointer to the Panthor device, needed to ring the global doorbell and
> +	 *         interface with FW.
> +	 */
> +	struct panthor_device *ptdev;
> +};
>  
>  struct panthor_perf {
>  	/**
> @@ -175,6 +341,9 @@ struct panthor_perf {
>  	 * @sessions: Global map of sessions, accessed by their ID.
>  	 */
>  	struct xarray sessions;
> +
> +	/** @sampler: FW control interface. */
> +	struct panthor_perf_sampler sampler;
>  };
>  
>  /**
> @@ -247,6 +416,23 @@ static struct panthor_perf_enable_masks *panthor_perf_create_em(struct drm_panth
>  	return em;
>  }
>  
> +static void panthor_perf_em_add(struct panthor_perf_enable_masks *dst_em,
> +		const struct panthor_perf_enable_masks *const src_em)
> +{
> +	size_t i = 0;
> +
> +	for (i = DRM_PANTHOR_PERF_BLOCK_FW; i <= DRM_PANTHOR_PERF_BLOCK_LAST; i++)
> +		bitmap_or(dst_em->mask[i], dst_em->mask[i], src_em->mask[i], PANTHOR_PERF_EM_BITS);
> +}
> +
> +static void panthor_perf_em_zero(struct panthor_perf_enable_masks *em)
> +{
> +	size_t i = 0;
> +
> +	for (i = DRM_PANTHOR_PERF_BLOCK_FW; i <= DRM_PANTHOR_PERF_BLOCK_LAST; i++)
> +		bitmap_zero(em->mask[i], PANTHOR_PERF_EM_BITS);
> +}
> +
>  static void panthor_perf_destroy_em_kref(struct kref *em_kref)
>  {
>  	struct panthor_perf_enable_masks *em = container_of(em_kref, typeof(*em), refs);
> @@ -270,6 +456,12 @@ static u32 session_read_extract_idx(struct panthor_perf_session *session)
>  	return smp_load_acquire(session->extract_idx);
>  }
>  
> +static void session_write_insert_idx(struct panthor_perf_session *session, u32 idx)
> +{
> +	/* Userspace needs the insert index to know where to look for the sample. */
> +	smp_store_release(session->insert_idx, idx);
> +}
> +
>  static u32 session_read_insert_idx(struct panthor_perf_session *session)
>  {
>  	return *session->insert_idx;
> @@ -349,6 +541,70 @@ static struct panthor_perf_session *session_find(struct panthor_file *pfile,
>  	return session;
>  }
>  
> +static u32 compress_enable_mask(unsigned long *const src)
> +{
> +	size_t i;
> +	u32 result = 0;
> +	unsigned long clump;
> +
> +	for_each_set_clump8(i, clump, src, PANTHOR_PERF_EM_BITS) {
> +		const unsigned long shift = div_u64(i, 4);
> +
> +		result |= !!(clump & GENMASK(3, 0)) << shift;
> +		result |= !!(clump & GENMASK(7, 4)) << (shift + 1);
> +	}
> +
> +	return result;
> +}
> +
> +static void expand_enable_mask(u32 em, unsigned long *const dst)
> +{
> +	size_t i;
> +	DECLARE_BITMAP(emb, BITS_PER_TYPE(u32));
> +
> +	bitmap_from_arr32(emb, &em, BITS_PER_TYPE(u32));
> +
> +	for_each_set_bit(i, emb, BITS_PER_TYPE(u32))
> +		bitmap_set(dst, i * 4, 4);
> +}
> +
> +/**
> + * panthor_perf_block_data - Identify the block index and type based on the offset.
> + *
> + * @desc:   FW buffer descriptor.
> + * @offset: The current offset being examined.
> + * @idx:    Pointer to an output index.
> + * @type:   Pointer to an output block type.
> + *
> + * To disambiguate different types of blocks as well as different blocks of the same type,
> + * the offset into the FW ringbuffer is used to uniquely identify the block being considered.
> + *
> + * In the future, this is a good time to identify whether a block will be empty,
> + * allowing us to short-circuit its processing after emitting header information.
> + */
> +static void panthor_perf_block_data(struct panthor_perf_buffer_descriptor *const desc,
> +		size_t offset, u32 *idx, enum drm_panthor_perf_block_type *type)
> +{
> +	unsigned long id;
> +
> +	for_each_set_bit(id, desc->available_blocks, DRM_PANTHOR_PERF_BLOCK_LAST) {
> +		const size_t block_start = desc->blocks[id].offset;
> +		const size_t block_count = desc->blocks[id].block_count;
> +		const size_t block_end = desc->blocks[id].offset +
> +			desc->block_size * block_count;
> +
> +		if (!block_count)
> +			continue;
> +
> +		if ((offset >= block_start) && (offset < block_end)) {
> +			*type = desc->blocks[id].type;

  [1] I think in this case, 'id' will always be the same as desc->blocks[id].type, so maybe
      just return 'id' instead of the type field, and you can remove it altogether.

> +			*idx = div_u64(offset - desc->blocks[id].offset, desc->block_size);
> +
> +			return;
> +		}
> +	}
> +}
> +
>  static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *const info)
>  {
>  	const size_t block_size = get_annotated_block_size(info->counters_per_block);
> @@ -358,6 +614,520 @@ static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *co
>  	return sizeof(struct drm_panthor_perf_sample_header) + (block_size * block_nr);
>  }
>  
> +static u32 panthor_perf_handle_sample(struct panthor_device *ptdev, u32 extract_idx, u32 insert_idx)
> +{
> +	struct panthor_perf *perf = ptdev->perf;
> +	struct panthor_perf_sampler *sampler = &ptdev->perf->sampler;
> +	const size_t ann_block_size =
> +		get_annotated_block_size(ptdev->perf_info.counters_per_block);
> +	u32 i;
> +
> +	for (i = extract_idx; i != insert_idx; i = (i + 1) % sampler->sample_slots) {
> +		u8 *fw_sample = (u8 *)sampler->rb->kmap + i * sampler->sample_size;
> +
> +		for (size_t fw_off = 0, ann_off = sizeof(struct drm_panthor_perf_sample_header);
> +				fw_off < sampler->desc.buffer_size;
> +				fw_off += sampler->desc.block_size)
> +
> +		{
> +			u32 idx;
> +			enum drm_panthor_perf_block_type type;
> +			DECLARE_BITMAP(expanded_em, PANTHOR_PERF_EM_BITS);
> +			struct panthor_perf_counter_block *blk =
> +				(typeof(blk))(perf->sampler.sample + ann_off);
> +			const u32 prfcnt_en = blk->counters[PANTHOR_CTR_PRFCNT_EN];
> +
> +			panthor_perf_block_data(&sampler->desc, fw_off, &idx, &type);
> +
> +			/**
> +			 * TODO Data from the metadata block must be used to populate the
> +			 * block state information.
> +			 */
> +			if (type == DRM_PANTHOR_PERF_BLOCK_METADATA)
> +				continue;
> +
> +			expand_enable_mask(prfcnt_en, expanded_em);
> +
> +			blk->header = (struct drm_panthor_perf_block_header) {
> +				.clock = 0,
> +				.block_idx = idx,
> +				.block_type = type,
> +				.block_states = DRM_PANTHOR_PERF_BLOCK_STATE_UNKNOWN
> +			};
> +			bitmap_to_arr64(blk->header.enable_mask, expanded_em, PANTHOR_PERF_EM_BITS);
> +
> +			u32 *block = (u32 *)(fw_sample + fw_off);
> +
> +			/*
> +			 * The four header counters must be treated differently, because they are
> +			 * not additive. For the fourth, the assignment does not matter, as it
> +			 * is reserved and should be zero.
> +			 */
> +			blk->counters[PANTHOR_CTR_TIMESTAMP_LO] = block[PANTHOR_CTR_TIMESTAMP_LO];
> +			blk->counters[PANTHOR_CTR_TIMESTAMP_HI] = block[PANTHOR_CTR_TIMESTAMP_HI];
> +			blk->counters[PANTHOR_CTR_PRFCNT_EN] = block[PANTHOR_CTR_PRFCNT_EN];
> +
> +			for (size_t k = PANTHOR_HEADER_COUNTERS;
> +					k < ptdev->perf_info.counters_per_block;
> +					k++)
> +				blk->counters[k] += block[k];
> +
> +			ann_off += ann_block_size;
> +		}
> +	}
> +
> +	return i;
> +}
> +
> +static size_t panthor_perf_get_fw_reported_size(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +
> +	size_t fw_size = GLB_PERFCNT_FW_SIZE(glb_iface->control->perfcnt_size);
> +	size_t hw_size = GLB_PERFCNT_HW_SIZE(glb_iface->control->perfcnt_size);
> +	size_t md_size = PERFCNT_FEATURES_MD_SIZE(glb_iface->control->perfcnt_features);
> +
> +	return md_size + fw_size + hw_size;
> +}
> +
> +#define PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, typ, blk_count, offset) \
> +	({ \
> +		(desc)->blocks[(typ)].type = (typ); \
> +		(desc)->blocks[(typ)].offset = (offset); \
> +		(desc)->blocks[(typ)].block_count = (blk_count);  \
> +		if ((blk_count))                                    \
> +			set_bit((typ), (desc)->available_blocks); \
> +		(offset) + ((desc)->block_size) * (blk_count); \
> +	 })
> +
> +static int panthor_perf_setup_fw_buffer_desc(struct panthor_device *ptdev,
> +		struct panthor_perf_sampler *sampler)
> +{
> +	const struct drm_panthor_perf_info *const info = &ptdev->perf_info;
> +	const size_t block_size = info->counters_per_block * sizeof(u32);
> +	struct panthor_perf_buffer_descriptor *desc = &sampler->desc;
> +	const size_t fw_sample_size = panthor_perf_get_fw_reported_size(ptdev);
> +	size_t offset = 0;
> +
> +	desc->block_size = block_size;
> +
> +	for (enum drm_panthor_perf_block_type type = 0; type < DRM_PANTHOR_PERF_BLOCK_MAX; type++) {
> +		switch (type) {
> +		case DRM_PANTHOR_PERF_BLOCK_METADATA:
> +			if (info->flags & DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT)
> +				offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc,
> +					DRM_PANTHOR_PERF_BLOCK_METADATA, 1, offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_FW:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->fw_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_CSG:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->csg_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_CSHW:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->cshw_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_TILER:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->tiler_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_MEMSYS:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->memsys_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_SHADER:
> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->shader_blocks,
> +					offset);
> +			break;
> +		case DRM_PANTHOR_PERF_BLOCK_MAX:
> +			drm_WARN_ON_ONCE(&ptdev->base,
> +					"DRM_PANTHOR_PERF_BLOCK_MAX should be unreachable!");
> +			break;
> +		}
> +	}
> +
> +	/* Computed size is not the same as the reported size, so we should not proceed in
> +	 * initializing the sampling session.
> +	 */
> +	if (offset != fw_sample_size)
> +		return -EINVAL;
> +
> +	desc->buffer_size = offset;
> +
> +	return 0;
> +}
> +
> +static int panthor_perf_fw_stop_sampling(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	u32 acked;
> +	int ret;
> +
> +	if (~READ_ONCE(glb_iface->input->req) & GLB_PERFCNT_ENABLE)
> +		return 0;
> +
> +	panthor_fw_update_reqs(glb_iface, req, 0, GLB_PERFCNT_ENABLE);
> +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> +	ret = panthor_fw_glb_wait_acks(ptdev, GLB_PERFCNT_ENABLE, &acked, 100);
> +	if (ret)
> +		drm_warn(&ptdev->base, "Could not disable performance counters");
> +
> +	return ret;
> +}
> +
> +static int panthor_perf_fw_start_sampling(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	u32 acked;
> +	int ret;
> +
> +	if (READ_ONCE(glb_iface->input->req) & GLB_PERFCNT_ENABLE)
> +		return 0;
> +
> +	panthor_fw_update_reqs(glb_iface, req, GLB_PERFCNT_ENABLE, GLB_PERFCNT_ENABLE);
> +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> +	ret = panthor_fw_glb_wait_acks(ptdev, GLB_PERFCNT_ENABLE, &acked, 100);
> +	if (ret)
> +		drm_warn(&ptdev->base, "Could not enable performance counters");
> +
> +	return ret;
> +}
> +
> +static void panthor_perf_fw_write_em(struct panthor_perf_sampler *sampler,
> +		struct panthor_perf_enable_masks *em)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(sampler->ptdev);
> +	u32 perfcnt_config;
> +
> +	glb_iface->input->perfcnt_csf_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_CSHW]);
> +	glb_iface->input->perfcnt_shader_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_SHADER]);
> +	glb_iface->input->perfcnt_mmu_l2_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_MEMSYS]);
> +	glb_iface->input->perfcnt_tiler_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_TILER]);
> +	glb_iface->input->perfcnt_fw_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_FW]);
> +	glb_iface->input->perfcnt_csg_enable =
> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_CSG]);
> +
> +	perfcnt_config = GLB_PRFCNT_CONFIG_SIZE(PANTHOR_PERF_FW_RINGBUF_SLOTS);
> +	perfcnt_config |= GLB_PRFCNT_CONFIG_SET(sampler->set_config);
> +	glb_iface->input->perfcnt_config = perfcnt_config;
> +
> +	/**
> +	 * The spec mandates that the host zero the PRFCNT_EXTRACT register before an enable
> +	 * operation, and each (re-)enable will require an enable-disable pair to program
> +	 * the new changes onto the FW interface.
> +	 */
> +	WRITE_ONCE(glb_iface->input->perfcnt_extract, 0);
> +}
> +
> +static void session_populate_sample_header(struct panthor_perf_session *session,
> +		struct drm_panthor_perf_sample_header *hdr)
> +{
> +	hdr->block_set = 0;
> +	hdr->user_data = session->user_data;
> +	hdr->timestamp_start_ns = session->sample_start_ns;
> +	/**
> +	 * TODO This should be changed to use the GPU clocks and the TIMESTAMP register,
> +	 * when support is added.
> +	 */
> +	hdr->timestamp_end_ns = ktime_get_raw_ns();
> +}
> +
> +/**
> + * session_patch_sample - Update the PRFCNT_EN header counter and the counters exposed to the
> + *                        userspace client to only contain requested counters.
> + *
> + * @ptdev: Panthor device
> + * @session: Perf session
> + * @sample: Starting offset of the sample in the userspace mapping.
> + *
> + * The hardware supports counter selection at the granularity of 1 bit per 4 counters, and there
> + * is a single global FW frontend to program the counter requests from multiple sessions. This may
> + * lead to a large disparity between the requested and provided counters for an individual client.
> + * To remove this cross-talk, we patch out the counters that have not been requested by this
> + * session and update the PRFCNT_EN, the header counter containing a bitmask of enabled counters,
> + * accordingly.
> + */
> +static void session_patch_sample(struct panthor_device *ptdev,
> +		struct panthor_perf_session *session, u8 *sample)
> +{
> +	const struct drm_panthor_perf_info *const perf_info = &ptdev->perf_info;
> +
> +	const size_t block_size = get_annotated_block_size(perf_info->counters_per_block);
> +	const size_t sample_size = session_get_max_sample_size(perf_info);
> +
> +	for (size_t i = 0; i < sample_size; i += block_size) {
> +		size_t ctr_idx;
> +		DECLARE_BITMAP(em_diff, PANTHOR_PERF_EM_BITS);
> +		struct panthor_perf_counter_block *blk = (typeof(blk))(sample + block_size);
> +		enum drm_panthor_perf_block_type type = blk->header.block_type;
> +		unsigned long *blk_em = session->enabled_counters->mask[type];
> +
> +		bitmap_from_arr64(em_diff, blk->header.enable_mask, PANTHOR_PERF_EM_BITS);
> +
> +		bitmap_andnot(em_diff, em_diff, blk_em, PANTHOR_PERF_EM_BITS);
> +
> +		for_each_set_bit(ctr_idx, em_diff, PANTHOR_PERF_EM_BITS)
> +			blk->counters[ctr_idx] = 0;
> +
> +		bitmap_to_arr64(blk->header.enable_mask, blk_em, PANTHOR_PERF_EM_BITS);
> +	}
> +}
> +
> +static int session_copy_sample(struct panthor_device *ptdev,
> +		struct panthor_perf_session *session)
> +{
> +	struct panthor_perf *perf = ptdev->perf;
> +	const size_t sample_size = session_get_max_sample_size(&ptdev->perf_info);
> +	const u32 insert_idx = session_read_insert_idx(session);
> +	const u32 extract_idx = session_read_extract_idx(session);
> +	u8 *new_sample;
> +
> +	if (!CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots))
> +		return -ENOSPC;
> +
> +	new_sample = session->samples + extract_idx * sample_size;

Wouldn't this have to be insert_idx instead? Since we're about to copy into the UM
ringbuffer, we should do it at the insert_idx location.

> +
> +	memcpy(new_sample, perf->sampler.sample, sample_size);
> +
> +	session_populate_sample_header(session,
> +			(struct drm_panthor_perf_sample_header *)new_sample);
> +
> +	session_patch_sample(ptdev, session, new_sample +
> +			sizeof(struct drm_panthor_perf_sample_header));
> +
> +	session_write_insert_idx(session, (insert_idx + 1) % session->ringbuf_slots);
> +
> +	/* Since we are about to notify userspace, we must ensure that all changes to memory
> +	 * are visible.
> +	 */
> +	wmb();
> +
> +	eventfd_signal(session->eventfd);
> +
> +	return 0;
> +}
> +
> +#define PERFCNT_IRQS (GLB_PERFCNT_OVERFLOW | GLB_PERFCNT_SAMPLE | GLB_PERFCNT_THRESHOLD)
> +
> +void panthor_perf_report_irq(struct panthor_device *ptdev, u32 status)
> +{
> +	struct panthor_perf *const perf = ptdev->perf;
> +	struct panthor_perf_sampler *sampler;
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +
> +	if (!(status & JOB_INT_GLOBAL_IF))
> +		return;
> +
> +	if (!perf)
> +		return;
> +
> +	sampler = &perf->sampler;
> +
> +	/* TODO This needs locking. */
> +	const u32 ack = READ_ONCE(glb_iface->output->ack);
> +	const u32 fw_events = sampler->last_ack ^ ack;
> +
> +	sampler->last_ack = ack;
> +
> +	if (!(fw_events & PERFCNT_IRQS))
> +		return;
> +
> +	/* TODO Fix up the error handling for overflow. */
> +	if (fw_events & GLB_PERFCNT_OVERFLOW)
> +		return;
> +
> +	if (fw_events & (GLB_PERFCNT_SAMPLE | GLB_PERFCNT_THRESHOLD)) {
> +		const u32 extract_idx = READ_ONCE(glb_iface->input->perfcnt_extract);
> +		const u32 insert_idx = READ_ONCE(glb_iface->output->perfcnt_insert);
> +
> +		WRITE_ONCE(glb_iface->input->perfcnt_extract,
> +				panthor_perf_handle_sample(ptdev, extract_idx, insert_idx));
> +	}
> +
> +	scoped_guard(mutex, &sampler->sampler_lock)
> +	{
> +		struct list_head *pos, *temp;
> +
> +		list_for_each_safe(pos, temp, &sampler->sampler_list) {
> +			struct panthor_perf_session *session = list_entry(pos,
> +					struct panthor_perf_session, waiting);
> +
> +			session_copy_sample(ptdev, session);
> +			list_del_init(pos);
> +
> +			session_put(session);
> +		}
> +	}
> +
> +	memset(sampler->sample, 0, session_get_max_sample_size(&ptdev->perf_info));
> +	sampler->sample_requested = false;
> +	complete(&sampler->sample_handled);
> +}
> +
> +
> +static int panthor_perf_sampler_init(struct panthor_perf_sampler *sampler,
> +		struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	struct panthor_kernel_bo *bo;
> +	u8 *sample;
> +	int ret;
> +
> +	ret = panthor_perf_setup_fw_buffer_desc(ptdev, sampler);
> +	if (ret) {
> +		drm_err(&ptdev->base,
> +				"Failed to setup descriptor for FW ring buffer, err = %d", ret);
> +		return ret;
> +	}
> +
> +	bo = panthor_kernel_bo_create(ptdev, panthor_fw_vm(ptdev),
> +			sampler->desc.buffer_size * PANTHOR_PERF_FW_RINGBUF_SLOTS,
> +			DRM_PANTHOR_BO_NO_MMAP,
> +			DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> +			PANTHOR_VM_KERNEL_AUTO_VA);
> +
> +	if (IS_ERR_OR_NULL(bo))
> +		return IS_ERR(bo) ? PTR_ERR(bo) : -ENOMEM;
> +
> +	ret = panthor_kernel_bo_vmap(bo);
> +	if (ret)
> +		goto cleanup_bo;
> +
> +	sample = devm_kzalloc(ptdev->base.dev,
> +			session_get_max_sample_size(&ptdev->perf_info), GFP_KERNEL);
> +	if (ZERO_OR_NULL_PTR(sample)) {
> +		ret = -ENOMEM;
> +		goto cleanup_vmap;
> +	}
> +
> +	glb_iface->input->perfcnt_as = panthor_vm_as(panthor_fw_vm(ptdev));
> +	glb_iface->input->perfcnt_base = panthor_kernel_bo_gpuva(bo);
> +	glb_iface->input->perfcnt_extract = 0;
> +	glb_iface->input->perfcnt_csg_select = GENMASK(glb_iface->control->group_num, 0);
> +
> +	sampler->rb = bo;
> +	sampler->sample = sample;
> +	sampler->sample_slots = PANTHOR_PERF_FW_RINGBUF_SLOTS;
> +
> +	sampler->em = panthor_perf_em_new();
> +
> +	mutex_init(&sampler->sampler_lock);
> +	mutex_init(&sampler->config_lock);
> +	INIT_LIST_HEAD(&sampler->sampler_list);
> +	INIT_LIST_HEAD(&sampler->ems);
> +	init_completion(&sampler->sample_handled);
> +
> +	sampler->ptdev = ptdev;
> +
> +	return 0;
> +
> +cleanup_vmap:
> +	panthor_kernel_bo_vunmap(bo);
> +
> +cleanup_bo:
> +	panthor_kernel_bo_destroy(bo);
> +
> +	return ret;
> +}
> +
> +static void panthor_perf_sampler_term(struct panthor_perf_sampler *sampler)
> +{
> +	int ret;
> +
> +	if (sampler->sample_requested)
> +		wait_for_completion_killable(&sampler->sample_handled);
> +
> +	panthor_perf_fw_write_em(sampler, &(struct panthor_perf_enable_masks) {});
> +
> +	ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
> +	if (ret)
> +		drm_warn_once(&sampler->ptdev->base, "Sampler termination failed, ret = %d", ret);
> +
> +	devm_kfree(sampler->ptdev->base.dev, sampler->sample);
> +
> +	panthor_kernel_bo_destroy(sampler->rb);
> +}
> +
> +static int panthor_perf_sampler_add(struct panthor_perf_sampler *sampler,
> +		struct panthor_perf_enable_masks *const new_em,
> +		u8 set)
> +{
> +	int ret = 0;
> +
> +	guard(mutex)(&sampler->config_lock);
> +
> +	/* Early check for whether a new set can be configured. */
> +	if (!atomic_read(&sampler->enabled_clients))
> +		sampler->set_config = set;
> +	else
> +		if (sampler->set_config != set)
> +			return -EBUSY;
> +
> +	kref_get(&new_em->refs);
> +	list_add_tail(&sampler->ems, &new_em->link);
> +
> +	panthor_perf_em_add(sampler->em, new_em);
> +	pm_runtime_get_sync(sampler->ptdev->base.dev);
> +
> +	if (atomic_read(&sampler->enabled_clients)) {
> +		ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	panthor_perf_fw_write_em(sampler, sampler->em);
> +
> +	ret = panthor_perf_fw_start_sampling(sampler->ptdev);
> +	if (ret)
> +		return ret;
> +
> +	atomic_inc(&sampler->enabled_clients);
> +
> +	return 0;
> +}
> +
> +static int panthor_perf_sampler_remove(struct panthor_perf_sampler *sampler,
> +		struct panthor_perf_enable_masks *session_em)
> +{
> +	int ret;
> +	struct list_head *em_node;
> +
> +	guard(mutex)(&sampler->config_lock);
> +
> +	list_del_init(&session_em->link);
> +	kref_put(&session_em->refs, panthor_perf_destroy_em_kref);
> +
> +	panthor_perf_em_zero(sampler->em);
> +	list_for_each(em_node, &sampler->ems)
> +	{
> +		struct panthor_perf_enable_masks *curr_em =
> +			container_of(em_node, typeof(*curr_em), link);
> +
> +		panthor_perf_em_add(sampler->em, curr_em);
> +	}
> +
> +	ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
> +	if (ret)
> +		return ret;
> +
> +	atomic_dec(&sampler->enabled_clients);
> +	pm_runtime_put_sync(sampler->ptdev->base.dev);
> +
> +	panthor_perf_fw_write_em(sampler, sampler->em);
> +
> +	if (atomic_read(&sampler->enabled_clients))
> +		return panthor_perf_fw_start_sampling(sampler->ptdev);
> +	return 0;
> +}
> +
>  /**
>   * panthor_perf_init - Initialize the performance counter subsystem.
>   * @ptdev: Panthor device
> @@ -370,6 +1140,7 @@ static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *co
>  int panthor_perf_init(struct panthor_device *ptdev)
>  {
>  	struct panthor_perf *perf;
> +	int ret;
>  
>  	if (!ptdev)
>  		return -EINVAL;
> @@ -386,12 +1157,93 @@ int panthor_perf_init(struct panthor_device *ptdev)
>  		.max = 1,
>  	};
>  
> +	ret = panthor_perf_sampler_init(&perf->sampler, ptdev);
> +	if (ret)
> +		goto cleanup_perf;
> +
>  	drm_info(&ptdev->base, "Performance counter subsystem initialized");
>  
>  	ptdev->perf = perf;
>  
> -	return 0;
> +	return ret;
> +
> +cleanup_perf:
> +	devm_kfree(ptdev->base.dev, perf);
> +
> +	return ret;
> +}
> +
> +
> +static void panthor_perf_fw_request_sample(struct panthor_perf_sampler *sampler)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(sampler->ptdev);
> +
> +	panthor_fw_toggle_reqs(glb_iface, req, ack, GLB_PERFCNT_SAMPLE);
> +	gpu_write(sampler->ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> +}
> +
> +/**
> + * panthor_perf_sampler_request_clearing - Request a clearing sample.
> + * @sampler: Panthor sampler
> + *
> + * Perform a synchronous sample that gets immediately discarded. This sets a baseline at the point
> + * of time a new session is started, to avoid having counters from before the session.
> + *
> + */
> +static int panthor_perf_sampler_request_clearing(struct panthor_perf_sampler *sampler)
> +{
> +	scoped_guard(mutex, &sampler->sampler_lock) {
> +		if (!sampler->sample_requested) {
> +			panthor_perf_fw_request_sample(sampler);
> +			sampler->sample_requested = true;
> +		}
> +	}
> +
> +	return wait_for_completion_timeout(&sampler->sample_handled,
> +			msecs_to_jiffies(1000));
> +}
> +
> +/**
> + * panthor_perf_sampler_request_sample - Request a counter sample for the userspace client.
> + * @sampler: Panthor sampler
> + * @session: Target session
> + *
> + * A session that has already requested a sample cannot request another one until the previous
> + * sample has been delivered.
> + *
> + * Return:
> + * * %0       - The sample has been requested successfully.
> + * * %-EBUSY  - The target session has already requested a sample and has not received it yet.
> + */
> +static int panthor_perf_sampler_request_sample(struct panthor_perf_sampler *sampler,
> +		struct panthor_perf_session *session)
> +{
> +	struct list_head *head;
> +
> +	reinit_completion(&sampler->sample_handled);
> +
> +	guard(mutex)(&sampler->sampler_lock);
> +
> +	/*
> +	 * If a previous sample has not been handled yet, the session cannot request another
> +	 * sample. If this happens too often, the requested sample rate is too high.
> +	 */
> +	list_for_each(head, &sampler->sampler_list) {
> +		struct panthor_perf_session *cur_session = list_entry(head,
> +				typeof(*cur_session), waiting);
> +
> +		if (session == cur_session)
> +			return -EBUSY;
> +	}
> +
> +	if (list_empty(&sampler->sampler_list) && !sampler->sample_requested)
> +		panthor_perf_fw_request_sample(sampler);
>  
> +	sampler->sample_requested = true;
> +	list_add_tail(&session->waiting, &sampler->sampler_list);
> +	session_get(session);
> +
> +	return 0;
>  }
>  
>  static int session_validate_set(u8 set)
> @@ -483,7 +1335,12 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
>  		goto cleanup_eventfd;
>  	}
>  
> +	ret = panthor_perf_sampler_add(&perf->sampler, em, setup_args->block_set);
> +	if (ret)
> +		goto cleanup_em;
> +
>  	INIT_LIST_HEAD(&session->waiting);
> +
>  	session->extract_idx = ctrl_map.vaddr;
>  	*session->extract_idx = 0;
>  	session->insert_idx = session->extract_idx + 1;
> @@ -507,12 +1364,15 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
>  	ret = xa_alloc_cyclic(&perf->sessions, &session_id, session, perf->session_range,
>  			&perf->next_session, GFP_KERNEL);
>  	if (ret < 0)
> -		goto cleanup_em;
> +		goto cleanup_sampler_add;
>  
>  	kref_init(&session->ref);
>  
>  	return session_id;
>  
> +cleanup_sampler_add:
> +	panthor_perf_sampler_remove(&perf->sampler, em);
> +
>  cleanup_em:
>  	kref_put(&em->refs, panthor_perf_destroy_em_kref);
>  
> @@ -540,6 +1400,8 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
>  static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *session,
>  		u64 user_data)
>  {
> +	int ret;
> +
>  	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
>  		return 0;
>  
> @@ -552,6 +1414,10 @@ static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *
>  
>  	session->user_data = user_data;
>  
> +	ret = panthor_perf_sampler_request_sample(&perf->sampler, session);
> +	if (ret)
> +		return ret;
> +
>  	clear_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
>  
>  	/* TODO Calls to the FW interface will go here in later patches. */
> @@ -573,8 +1439,7 @@ static int session_start(struct panthor_perf *perf, struct panthor_perf_session
>  	if (session->sample_freq_ns)
>  		session->user_data = user_data;
>  
> -	/* TODO Calls to the FW interface will go here in later patches. */
> -	return 0;
> +	return panthor_perf_sampler_request_clearing(&perf->sampler);
>  }
>  
>  static int session_sample(struct panthor_perf *perf, struct panthor_perf_session *session,
> @@ -601,15 +1466,16 @@ static int session_sample(struct panthor_perf *perf, struct panthor_perf_session
>  	session->sample_start_ns = ktime_get_raw_ns();
>  	session->user_data = user_data;
>  
> -	/* TODO Calls to the FW interface will go here in later patches. */
> -	return 0;
> +	return panthor_perf_sampler_request_sample(&perf->sampler, session);
>  }
>  
>  static int session_destroy(struct panthor_perf *perf, struct panthor_perf_session *session)
>  {
> +	int ret = panthor_perf_sampler_remove(&perf->sampler, session->enabled_counters);
> +
>  	session_put(session);
>  
> -	return 0;
> +	return ret;
>  }
>  
>  static int session_teardown(struct panthor_perf *perf, struct panthor_perf_session *session)
> @@ -813,6 +1679,8 @@ void panthor_perf_unplug(struct panthor_device *ptdev)
>  
>  	xa_destroy(&perf->sessions);
>  
> +	panthor_perf_sampler_term(&perf->sampler);
> +
>  	devm_kfree(ptdev->base.dev, ptdev->perf);
>  
>  	ptdev->perf = NULL;
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
> index bfef8874068b..3485e4a55e15 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.h
> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
> @@ -31,4 +31,6 @@ int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf
>  		u32 sid, u64 user_data);
>  void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf);
>  
> +void panthor_perf_report_irq(struct panthor_device *ptdev, u32 status);
> +
>  #endif /* __PANTHOR_PERF_H__ */
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> index 576d3ad46e6d..a29b755d6556 100644
> --- a/include/uapi/drm/panthor_drm.h
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -441,8 +441,11 @@ enum drm_panthor_perf_feat_flags {
>   * enum drm_panthor_perf_block_type - Performance counter supported block types.
>   */
>  enum drm_panthor_perf_block_type {
> +	/** DRM_PANTHOR_PERF_BLOCK_METADATA: Internal use only. */
> +	DRM_PANTHOR_PERF_BLOCK_METADATA = 0,
> +
>  	/** @DRM_PANTHOR_PERF_BLOCK_FW: The FW counter block. */
> -	DRM_PANTHOR_PERF_BLOCK_FW = 1,
> +	DRM_PANTHOR_PERF_BLOCK_FW,
>  
>  	/** @DRM_PANTHOR_PERF_BLOCK_CSG: A CSG counter block. */
>  	DRM_PANTHOR_PERF_BLOCK_CSG,
> -- 
> 2.25.1


Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 5/8] drm/panthor: Introduce sampling sessions to handle userspace clients
  2024-12-11 16:50 ` [RFC v2 5/8] drm/panthor: Introduce sampling sessions to handle userspace clients Lukas Zapolskas
  2025-01-27 15:43   ` Adrián Larumbe
@ 2025-01-27 21:39   ` Adrián Larumbe
  1 sibling, 0 replies; 28+ messages in thread
From: Adrián Larumbe @ 2025-01-27 21:39 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> To allow for combining the requests from multiple userspace clients, an
> intermediary layer between the HW/FW interfaces and userspace is
> created, containing the information for the counter requests and
> tracking of insert and extract indices. Each session starts inactive and
> must be explicitly activated via PERF_CONTROL.START, and explicitly
> stopped via PERF_CONTROL.STOP. Userspace identifies a single client with
> its session ID and the panthor file it is associated with.
> 
> The SAMPLE and STOP commands both produce a single sample when called,
> and these samples can be disambiguated via the opaque user data field
> passed in the PERF_CONTROL uAPI. If this functionality is not desired,
> these fields can be kept as zero, as the kernel copies this value into
> the corresponding sample without attempting to interpret it.
> 
> Currently, only manual sampling sessions are supported, providing
> samples when userspace calls PERF_CONTROL.SAMPLE, and only a single
> session is allowed at a time. Multiple sessions and periodic sampling
> will be enabled in following patches.
> 
> No protected is provided against the 32-bit hardware counter overflows,
> so for the moment it is up to userspace to ensure that the counters are
> sampled at a reasonable frequency.
> 
> The counter set enum is added to the uapi to clarify the restrictions on
> calling the interface.
> 
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.h |   3 +
>  drivers/gpu/drm/panthor/panthor_drv.c    |   1 +
>  drivers/gpu/drm/panthor/panthor_perf.c   | 697 ++++++++++++++++++++++-
>  include/uapi/drm/panthor_drm.h           |  50 +-
>  4 files changed, 732 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index aca33d03036c..9ed1e9aed521 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -210,6 +210,9 @@ struct panthor_file {
>  	/** @ptdev: Device attached to this file. */
>  	struct panthor_device *ptdev;
>  
> +	/** @drm_file: Corresponding drm_file */
> +	struct drm_file *drm_file;

I think we could avoid keeping this reference here, see [1].

>  	/** @vms: VM pool attached to this file. */
>  	struct panthor_vm_pool *vms;
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index 458175f58b15..2848ab442d10 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -1505,6 +1505,7 @@ panthor_open(struct drm_device *ddev, struct drm_file *file)
>  	}
>  
>  	pfile->ptdev = ptdev;
> +	pfile->drm_file = file;
>  
>  	ret = panthor_vm_pool_create(pfile);
>  	if (ret)
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
> index 6498279ec036..42d8b6f8c45d 100644
> --- a/drivers/gpu/drm/panthor/panthor_perf.c
> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
> @@ -3,16 +3,162 @@
>  /* Copyright 2024 Arm ltd. */
>  
>  #include <drm/drm_file.h>
> +#include <drm/drm_gem.h>
>  #include <drm/drm_gem_shmem_helper.h>
>  #include <drm/drm_managed.h>
> +#include <drm/drm_print.h>
>  #include <drm/panthor_drm.h>
>  
> +#include <linux/circ_buf.h>
> +#include <linux/iosys-map.h>
> +#include <linux/pm_runtime.h>
> +
>  #include "panthor_device.h"
>  #include "panthor_fw.h"
>  #include "panthor_gpu.h"
>  #include "panthor_perf.h"
>  #include "panthor_regs.h"
>  
> +/**
> + * PANTHOR_PERF_EM_BITS - Number of bits in a user-facing enable mask. This must correspond
> + *                        to the maximum number of counters available for selection on the newest
> + *                        Mali GPUs (128 as of the Mali-Gx15).
> + */
> +#define PANTHOR_PERF_EM_BITS (BITS_PER_TYPE(u64) * 2)
> +
> +/**
> + * enum panthor_perf_session_state - Session state bits.
> + */
> +enum panthor_perf_session_state {
> +	/** @PANTHOR_PERF_SESSION_ACTIVE: The session is active and can be used for sampling. */
> +	PANTHOR_PERF_SESSION_ACTIVE = 0,
> +
> +	/**
> +	 * @PANTHOR_PERF_SESSION_OVERFLOW: The session encountered an overflow in one of the
> +	 *                                 counters during the last sampling period. This flag
> +	 *                                 gets propagated as part of samples emitted for this
> +	 *                                 session, to ensure the userspace client can gracefully
> +	 *                                 handle this data corruption.
> +	 */
> +	PANTHOR_PERF_SESSION_OVERFLOW,
> +
> +	/** @PANTHOR_PERF_SESSION_MAX: Bits needed to represent the state. Must be last.*/
> +	PANTHOR_PERF_SESSION_MAX,
> +};
> +
> +struct panthor_perf_enable_masks {
> +	/**
> +	 * @link: List node used to keep track of the enable masks aggregated by the sampler.
> +	 */
> +	struct list_head link;
> +
> +	/** @refs: Number of references taken out on an instantiated enable mask. */
> +	struct kref refs;
> +
> +	/**
> +	 * @mask: Array of bitmasks indicating the counters userspace requested, where
> +	 *        one bit represents a single counter. Used to build the firmware configuration
> +	 *        and ensure that userspace clients obtain only the counters they requested.
> +	 */
> +	DECLARE_BITMAP(mask, PANTHOR_PERF_EM_BITS)[DRM_PANTHOR_PERF_BLOCK_MAX];
> +};
> +
> +struct panthor_perf_counter_block {
> +	struct drm_panthor_perf_block_header header;
> +	u64 counters[];
> +};
> +
> +struct panthor_perf_session {
> +	DECLARE_BITMAP(state, PANTHOR_PERF_SESSION_MAX);
> +
> +	/**
> +	 * @user_sample_size: The size of a single sample as exposed to userspace. For the sake of
> +	 *                    simplicity, the current implementation exposes the same structure
> +	 *                    as provided by firmware, after annotating the sample and the blocks,
> +	 *                    and zero-extending the counters themselves (to account for in-kernel
> +	 *                    accumulation).
> +	 *
> +	 *                    This may also allow further memory-optimizations of compressing the
> +	 *                    sample to provide only requested blocks, if deemed to be worth the
> +	 *                    additional complexity.
> +	 */
> +	size_t user_sample_size;
> +
> +	/**
> +	 * @sample_freq_ns: Period between subsequent sample requests. Zero indicates that
> +	 *                  userspace will be responsible for requesting samples.
> +	 */
> +	u64 sample_freq_ns;
> +
> +	/** @sample_start_ns: Sample request time, obtained from a monotonic raw clock. */
> +	u64 sample_start_ns;
> +
> +	/**
> +	 * @user_data: Opaque handle passed in when starting a session, requesting a sample (for
> +	 *             manual sampling sessions only) and when stopping a session. This handle
> +	 *             allows the disambiguation of a sample in the ringbuffer.
> +	 */
> +	u64 user_data;
> +
> +	/**
> +	 * @eventfd: Event file descriptor context used to signal userspace of a new sample
> +	 *           being emitted.
> +	 */
> +	struct eventfd_ctx *eventfd;
> +
> +	/**
> +	 * @enabled_counters: This session's requested counters. Note that these cannot change
> +	 *                    for the lifetime of the session.
> +	 */
> +	struct panthor_perf_enable_masks *enabled_counters;
> +
> +	/** @ringbuf_slots: Slots in the user-facing ringbuffer. */
> +	size_t ringbuf_slots;
> +
> +	/** @ring_buf: BO for the userspace ringbuffer. */
> +	struct drm_gem_object *ring_buf;
> +
> +	/**
> +	 * @control_buf: BO for the insert and extract indices.
> +	 */
> +	struct drm_gem_object *control_buf;
> +
> +	/**
> +	 * @extract_idx: The extract index is used by userspace to indicate the position of the
> +	 *               consumer in the ringbuffer.
> +	 */
> +	u32 *extract_idx;
> +
> +	/**
> +	 * @insert_idx: The insert index is used by the kernel to indicate the position of the
> +	 *              latest sample exposed to userspace.
> +	 */
> +	u32 *insert_idx;
> +
> +	/** @samples: The mapping of the @ring_buf into the kernel's VA space. */
> +	u8 *samples;
> +
> +	/**
> +	 * @waiting: The list node used by the sampler to track the sessions waiting for a sample.
> +	 */
> +	struct list_head waiting;
> +
> +	/**
> +	 * @pfile: The panthor file which was used to create a session, used for the postclose
> +	 *         handling and to prevent a misconfigured userspace from closing unrelated
> +	 *         sessions.
> +	 */
> +	struct panthor_file *pfile;
> +
> +	/**
> +	 * @ref: Session reference count. The sample delivery to userspace is asynchronous, meaning
> +	 *       the lifetime of the session must extend at least until the sample is exposed to
> +	 *       userspace.
> +	 */
> +	struct kref ref;
> +};
> +
> +
>  struct panthor_perf {
>  	/**
>  	 * @block_set: The global counter set configured onto the HW.
> @@ -63,39 +209,154 @@ void panthor_perf_info_init(struct panthor_device *ptdev)
>  	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
>  }
>  
> -int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
> -		struct drm_panthor_perf_cmd_setup *setup_args,
> -		struct panthor_file *pfile)
> +static struct panthor_perf_enable_masks *panthor_perf_em_new(void)
>  {
> -	return -EOPNOTSUPP;
> +	struct panthor_perf_enable_masks *em = kmalloc(sizeof(*em), GFP_KERNEL);
> +
> +	if (!em)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&em->link);
> +
> +	kref_init(&em->refs);
> +
> +	return em;
>  }
>  
> -int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf,
> -		u32 sid)
> +static struct panthor_perf_enable_masks *panthor_perf_create_em(struct drm_panthor_perf_cmd_setup
> +		*setup_args)
>  {
> -	return -EOPNOTSUPP;
> +	struct panthor_perf_enable_masks *em = panthor_perf_em_new();
> +
> +	if (IS_ERR_OR_NULL(em))
> +		return em;
> +
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_FW],
> +			setup_args->fw_enable_mask, PANTHOR_PERF_EM_BITS);
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_CSG],
> +			setup_args->csg_enable_mask, PANTHOR_PERF_EM_BITS);
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_CSHW],
> +			setup_args->cshw_enable_mask, PANTHOR_PERF_EM_BITS);
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_TILER],
> +			setup_args->tiler_enable_mask, PANTHOR_PERF_EM_BITS);
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_MEMSYS],
> +			setup_args->memsys_enable_mask, PANTHOR_PERF_EM_BITS);
> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_SHADER],
> +			setup_args->shader_enable_mask, PANTHOR_PERF_EM_BITS);
> +
> +	return em;
>  }
>  
> -int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
> -		u32 sid, u64 user_data)
> +static void panthor_perf_destroy_em_kref(struct kref *em_kref)
>  {
> -	return -EOPNOTSUPP;
> +	struct panthor_perf_enable_masks *em = container_of(em_kref, typeof(*em), refs);
> +
> +	if (!list_empty(&em->link))
> +		return;
> +
> +	kfree(em);
>  }
>  
> -int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
> -		u32 sid, u64 user_data)
> +static size_t get_annotated_block_size(size_t counters_per_block)
>  {
> -		return -EOPNOTSUPP;
> +	return struct_size_t(struct panthor_perf_counter_block, counters, counters_per_block);
>  }
>  
> -int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
> -		u32 sid, u64 user_data)
> +static u32 session_read_extract_idx(struct panthor_perf_session *session)
> +{
> +	/* Userspace will update their own extract index to indicate that a sample is consumed
> +	 * from the ringbuffer, and we must ensure we read the latest value.
> +	 */
> +	return smp_load_acquire(session->extract_idx);
> +}
> +
> +static u32 session_read_insert_idx(struct panthor_perf_session *session)
> +{
> +	return *session->insert_idx;
> +}
> +
> +static void session_get(struct panthor_perf_session *session)
> +{
> +	kref_get(&session->ref);
> +}
> +
> +static void session_free(struct kref *ref)
> +{
> +	struct panthor_perf_session *session = container_of(ref, typeof(*session), ref);
> +
> +	if (session->samples) {
> +		struct iosys_map map = IOSYS_MAP_INIT_VADDR(session->samples);
> +
> +		drm_gem_vunmap_unlocked(session->ring_buf, &map);
> +		drm_gem_object_put(session->ring_buf);
> +	}
> +
> +	if (session->insert_idx && session->extract_idx) {
> +		struct iosys_map map = IOSYS_MAP_INIT_VADDR(session->extract_idx);
> +
> +		drm_gem_vunmap_unlocked(session->control_buf, &map);
> +		drm_gem_object_put(session->control_buf);
> +	}
> +
> +	kref_put(&session->enabled_counters->refs, panthor_perf_destroy_em_kref);
> +	eventfd_ctx_put(session->eventfd);
> +
> +	devm_kfree(session->pfile->ptdev->base.dev, session);
> +}
> +
> +static void session_put(struct panthor_perf_session *session)
> +{
> +	kref_put(&session->ref, session_free);
> +}
> +
> +/**
> + * session_find - Find a session associated with the given session ID and
> + *                panthor_file.
> + * @pfile: Panthor file.
> + * @perf: Panthor perf.
> + * @sid: Session ID.
> + *
> + * The reference count of a valid session is increased to ensure it does not disappear
> + * in the window between the XA lock being dropped and the internal session functions
> + * being called.
> + *
> + * Return: valid session pointer or an ERR_PTR.
> + */
> +static struct panthor_perf_session *session_find(struct panthor_file *pfile,
> +		struct panthor_perf *perf, u32 sid)
>  {
> -	return -EOPNOTSUPP;
> +	struct panthor_perf_session *session;
>  
> +	if (!perf)
> +		return ERR_PTR(-EINVAL);
> +
> +	xa_lock(&perf->sessions);
> +	session = xa_load(&perf->sessions, sid);
> +
> +	if (!session || xa_is_err(session)) {
> +		xa_unlock(&perf->sessions);
> +		return ERR_PTR(-EBADF);
> +	}
> +
> +	if (session->pfile != pfile) {
> +		xa_unlock(&perf->sessions);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	session_get(session);
> +	xa_unlock(&perf->sessions);
> +
> +	return session;
>  }
>  
> -void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf) { }
> +static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *const info)
> +{
> +	const size_t block_size = get_annotated_block_size(info->counters_per_block);
> +	const size_t block_nr = info->cshw_blocks + info->csg_blocks + info->fw_blocks +
> +		info->tiler_blocks + info->memsys_blocks + info->shader_blocks;
> +
> +	return sizeof(struct drm_panthor_perf_sample_header) + (block_size * block_nr);
> +}
>  
>  /**
>   * panthor_perf_init - Initialize the performance counter subsystem.
> @@ -130,6 +391,399 @@ int panthor_perf_init(struct panthor_device *ptdev)
>  	ptdev->perf = perf;
>  
>  	return 0;
> +
> +}
> +
> +static int session_validate_set(u8 set)
> +{
> +	if (set > DRM_PANTHOR_PERF_SET_TERTIARY)
> +		return -EINVAL;
> +
> +	if (set == DRM_PANTHOR_PERF_SET_PRIMARY)
> +		return 0;
> +
> +	if (set > DRM_PANTHOR_PERF_SET_PRIMARY)
> +		return capable(CAP_PERFMON) ? 0 : -EACCES;
> +
> +	return -EINVAL;
> +}
> +
> +/**
> + * panthor_perf_session_setup - Create a user-visible session.
> + *
> + * @ptdev: Handle to the panthor device.
> + * @perf: Handle to the perf control structure.
> + * @setup_args: Setup arguments passed in via ioctl.
> + * @pfile: Panthor file associated with the request.
> + *
> + * Creates a new session associated with the session ID returned. When initialized, the
> + * session must explicitly request sampling to start with a successive call to PERF_CONTROL.START.
> + *
> + * Return: non-negative session identifier on success or negative error code on failure.
> + */
> +int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
> +		struct drm_panthor_perf_cmd_setup *setup_args,
> +		struct panthor_file *pfile)
> +{
> +	struct panthor_perf_session *session;
> +	struct drm_gem_object *ringbuffer;
> +	struct drm_gem_object *control;
> +	const size_t slots = setup_args->sample_slots;
> +	struct panthor_perf_enable_masks *em;
> +	struct iosys_map rb_map, ctrl_map;
> +	size_t user_sample_size;
> +	int session_id;
> +	int ret;
> +
> +	ret = session_validate_set(setup_args->block_set);
> +	if (ret)
> +		return ret;
> +
> +	session = devm_kzalloc(ptdev->base.dev, sizeof(*session), GFP_KERNEL);
> +	if (ZERO_OR_NULL_PTR(session))
> +		return -ENOMEM;
> +
> +	ringbuffer = drm_gem_object_lookup(pfile->drm_file, setup_args->ringbuf_handle);
> +	if (!ringbuffer) {
> +		ret = -EINVAL;
> +		goto cleanup_session;
> +	}
> +
> +	control = drm_gem_object_lookup(pfile->drm_file, setup_args->control_handle);
> +	if (!control) {
> +		ret = -EINVAL;
> +		goto cleanup_ringbuf;
> +	}

If you pass panthor_perf_session_setup() the drm_file pointer instead of the
panthor_file one, you could do

struct panthor_file *pfile = file->driver_priv;

at the beginning of the function, and avoid storing the pointer back to the drm_file in the
panthor_file struct, because panthor_file::drm_file isn't used anywhere else outside of this
function.

> +	user_sample_size = session_get_max_sample_size(&ptdev->perf_info) * slots;
> +
> +	if (ringbuffer->size != PFN_ALIGN(user_sample_size)) {
> +		ret = -ENOMEM;
> +		goto cleanup_control;
> +	}
> +
> +	ret = drm_gem_vmap_unlocked(ringbuffer, &rb_map);
> +	if (ret)
> +		goto cleanup_control;
> +
> +
> +	ret = drm_gem_vmap_unlocked(control, &ctrl_map);
> +	if (ret)
> +		goto cleanup_ring_map;
> +
> +	session->eventfd = eventfd_ctx_fdget(setup_args->fd);
> +	if (IS_ERR_OR_NULL(session->eventfd)) {
> +		ret = PTR_ERR_OR_ZERO(session->eventfd) ?: -EINVAL;
> +		goto cleanup_control_map;
> +	}
> +
> +	em = panthor_perf_create_em(setup_args);
> +	if (IS_ERR_OR_NULL(em)) {
> +		ret = -ENOMEM;
> +		goto cleanup_eventfd;
> +	}
> +
> +	INIT_LIST_HEAD(&session->waiting);
> +	session->extract_idx = ctrl_map.vaddr;
> +	*session->extract_idx = 0;
> +	session->insert_idx = session->extract_idx + 1;
> +	*session->insert_idx = 0;
> +
> +	session->samples = rb_map.vaddr;
> +
> +	/* TODO This will need validation when we support periodic sampling sessions */
> +	if (setup_args->sample_freq_ns) {
> +		ret = -EOPNOTSUPP;
> +		goto cleanup_em;
> +	}
> +
> +	session->sample_freq_ns = setup_args->sample_freq_ns;
> +	session->user_sample_size = user_sample_size;
> +	session->enabled_counters = em;
> +	session->ring_buf = ringbuffer;
> +	session->control_buf = control;
> +	session->pfile = pfile;
> +
> +	ret = xa_alloc_cyclic(&perf->sessions, &session_id, session, perf->session_range,
> +			&perf->next_session, GFP_KERNEL);
> +	if (ret < 0)
> +		goto cleanup_em;
> +
> +	kref_init(&session->ref);
> +
> +	return session_id;
> +
> +cleanup_em:
> +	kref_put(&em->refs, panthor_perf_destroy_em_kref);
> +
> +cleanup_eventfd:
> +	eventfd_ctx_put(session->eventfd);
> +
> +cleanup_control_map:
> +	drm_gem_vunmap_unlocked(control, &ctrl_map);
> +
> +cleanup_ring_map:
> +	drm_gem_vunmap_unlocked(ringbuffer, &rb_map);
> +
> +cleanup_control:
> +	drm_gem_object_put(control);
> +
> +cleanup_ringbuf:
> +	drm_gem_object_put(ringbuffer);
> +
> +cleanup_session:
> +	devm_kfree(ptdev->base.dev, session);
> +
> +	return ret;
> +}
> +
> +static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *session,
> +		u64 user_data)
> +{
> +	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
> +		return 0;
> +
> +	const u32 extract_idx = session_read_extract_idx(session);
> +	const u32 insert_idx = session_read_insert_idx(session);
> +
> +	/* Must have at least one slot remaining in the ringbuffer to sample. */
> +	if (WARN_ON_ONCE(!CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots)))
> +		return -EBUSY;
> +
> +	session->user_data = user_data;
> +
> +	clear_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
> +
> +	/* TODO Calls to the FW interface will go here in later patches. */
> +	return 0;
> +}
> +
> +static int session_start(struct panthor_perf *perf, struct panthor_perf_session *session,
> +		u64 user_data)
> +{
> +	if (test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
> +		return 0;
> +
> +	set_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
> +
> +	/*
> +	 * For manual sampling sessions, a start command does not correspond to a sample,
> +	 * and so the user data gets discarded.
> +	 */
> +	if (session->sample_freq_ns)
> +		session->user_data = user_data;
> +
> +	/* TODO Calls to the FW interface will go here in later patches. */
> +	return 0;
> +}
> +
> +static int session_sample(struct panthor_perf *perf, struct panthor_perf_session *session,
> +		u64 user_data)
> +{
> +	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
> +		return -EACCES;
> +
> +	const u32 extract_idx = session_read_extract_idx(session);
> +	const u32 insert_idx = session_read_insert_idx(session);
> +
> +	/* Manual sampling for periodic sessions is forbidden. */
> +	if (session->sample_freq_ns)
> +		return -EINVAL;
> +
> +	/*
> +	 * Must have at least two slots remaining in the ringbuffer to sample: one for
> +	 * the current sample, and one for a stop sample, since a stop command should
> +	 * always be acknowledged by taking a final sample and stopping the session.
> +	 */
> +	if (CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots) < 2)
> +		return -EBUSY;
> +
> +	session->sample_start_ns = ktime_get_raw_ns();
> +	session->user_data = user_data;
> +
> +	/* TODO Calls to the FW interface will go here in later patches. */
> +	return 0;
> +}
> +
> +static int session_destroy(struct panthor_perf *perf, struct panthor_perf_session *session)
> +{
> +	session_put(session);
> +
> +	return 0;
> +}
> +
> +static int session_teardown(struct panthor_perf *perf, struct panthor_perf_session *session)
> +{
> +	if (test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
> +		return -EINVAL;
> +
> +	if (!list_empty(&session->waiting))
> +		return -EBUSY;
> +
> +	return session_destroy(perf, session);
> +}
> +
> +/**
> + * panthor_perf_session_teardown - Teardown the session associated with the @sid.
> + * @pfile: Open panthor file.
> + * @perf: Handle to the perf control structure.
> + * @sid: Session identifier.
> + *
> + * Destroys a stopped session where the last sample has been explicitly consumed
> + * or discarded. Active sessions will be ignored.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf, u32 sid)
> +{
> +	int err;
> +	struct panthor_perf_session *session;
> +
> +	xa_lock(&perf->sessions);
> +	session = __xa_store(&perf->sessions, sid, NULL, GFP_KERNEL);
> +
> +	if (xa_is_err(session)) {
> +		err = xa_err(session);
> +		goto restore;
> +	}
> +
> +	if (session->pfile != pfile) {
> +		err = -EINVAL;
> +		goto restore;
> +	}
> +
> +	session_get(session);
> +	xa_unlock(&perf->sessions);
> +
> +	err = session_teardown(perf, session);
> +
> +	session_put(session);
> +
> +	return err;
> +
> +restore:
> +	__xa_store(&perf->sessions, sid, session, GFP_KERNEL);
> +	xa_unlock(&perf->sessions);
> +
> +	return err;
> +}
> +
> +/**
> + * panthor_perf_session_start - Start sampling on a stopped session.
> + * @pfile: Open panthor file.
> + * @perf: Handle to the panthor perf control structure.
> + * @sid: Session identifier for the desired session.
> + * @user_data: An opaque value passed in from userspace.
> + *
> + * A session counts as stopped when it is created or when it is explicitly stopped after being
> + * started. Starting an active session is treated as a no-op.
> + *
> + * The @user_data parameter will be associated with all subsequent samples for a periodic
> + * sampling session and will be ignored for manual sampling ones in favor of the user data
> + * passed in the PERF_CONTROL.SAMPLE ioctl call.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data)
> +{
> +	struct panthor_perf_session *session = session_find(pfile, perf, sid);
> +	int err;
> +
> +	if (IS_ERR_OR_NULL(session))
> +		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
> +
> +	err = session_start(perf, session, user_data);
> +
> +	session_put(session);
> +
> +	return err;
> +}
> +
> +/**
> + * panthor_perf_session_stop - Stop sampling on an active session.
> + * @pfile: Open panthor file.
> + * @perf: Handle to the panthor perf control structure.
> + * @sid: Session identifier for the desired session.
> + * @user_data: An opaque value passed in from userspace.
> + *
> + * A session counts as active when it has been explicitly started via the PERF_CONTROL.START
> + * ioctl. Stopping a stopped session is treated as a no-op.
> + *
> + * To ensure data is not lost when sampling is stopping, there must always be at least one slot
> + * available for the final automatic sample, and the stop command will be rejected if there is not.
> + *
> + * The @user_data will always be associated with the final sample.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data)
> +{
> +	struct panthor_perf_session *session = session_find(pfile, perf, sid);
> +	int err;
> +
> +	if (IS_ERR_OR_NULL(session))
> +		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
> +
> +	err = session_stop(perf, session, user_data);
> +
> +	session_put(session);
> +
> +	return err;
> +}
> +
> +/**
> + * panthor_perf_session_sample - Request a sample on a manual sampling session.
> + * @pfile: Open panthor file.
> + * @perf: Handle to the panthor perf control structure.
> + * @sid: Session identifier for the desired session.
> + * @user_data: An opaque value passed in from userspace.
> + *
> + * Only an active manual sampler is permitted to request samples directly. Failing to meet either
> + * of these conditions will cause the sampling request to be rejected. Requesting a manual sample
> + * with a full ringbuffer will see the request being rejected.
> + *
> + * The @user_data will always be unambiguously associated one-to-one with the resultant sample.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
> +		u32 sid, u64 user_data)
> +{
> +	struct panthor_perf_session *session = session_find(pfile, perf, sid);
> +	int err;
> +
> +	if (IS_ERR_OR_NULL(session))
> +		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
> +
> +	err = session_sample(perf, session, user_data);
> +
> +	session_put(session);
> +
> +	return err;
> +}
> +
> +/**
> + * panthor_perf_session_destroy - Destroy a sampling session associated with the @pfile.
> + * @perf: Handle to the panthor perf control structure.
> + * @pfile: The file being closed.
> + *
> + * Must be called when the corresponding userspace process is destroyed and cannot close its
> + * own sessions. As such, we offer no guarantees about data delivery.
> + */
> +void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf)
> +{
> +	unsigned long sid;
> +	struct panthor_perf_session *session;
> +
> +	xa_for_each(&perf->sessions, sid, session)
> +	{
> +		if (session->pfile == pfile) {
> +			session_destroy(perf, session);
> +			xa_erase(&perf->sessions, sid);
> +		}
> +	}
>  }
>  
>  /**
> @@ -146,10 +800,17 @@ void panthor_perf_unplug(struct panthor_device *ptdev)
>  	if (!perf)
>  		return;
>  
> -	if (!xa_empty(&perf->sessions))
> +	if (!xa_empty(&perf->sessions)) {
> +		unsigned long sid;
> +		struct panthor_perf_session *session;
> +
>  		drm_err(&ptdev->base,
>  				"Performance counter sessions active when unplugging the driver!");
>  
> +		xa_for_each(&perf->sessions, sid, session)
> +			session_destroy(perf, session);
> +	}
> +
>  	xa_destroy(&perf->sessions);
>  
>  	devm_kfree(ptdev->base.dev, ptdev->perf);
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> index 8a431431da6b..576d3ad46e6d 100644
> --- a/include/uapi/drm/panthor_drm.h
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -458,6 +458,12 @@ enum drm_panthor_perf_block_type {
>  
>  	/** @DRM_PANTHOR_PERF_BLOCK_SHADER: A shader core counter block. */
>  	DRM_PANTHOR_PERF_BLOCK_SHADER,
> +
> +	/** @DRM_PANTHOR_PERF_BLOCK_LAST: Internal use only. */
> +	DRM_PANTHOR_PERF_BLOCK_LAST = DRM_PANTHOR_PERF_BLOCK_SHADER,
> +
> +	/** @DRM_PANTHOR_PERF_BLOCK_MAX: Internal use only. */
> +	DRM_PANTHOR_PERF_BLOCK_MAX = DRM_PANTHOR_PERF_BLOCK_LAST + 1,
>  };
>  
>  /**
> @@ -1368,6 +1374,44 @@ struct drm_panthor_perf_control {
>  	__u64 pointer;
>  };
>  
> +/**
> + * enum drm_panthor_perf_counter_set - The counter set to be requested from the hardware.
> + *
> + * The hardware supports a single performance counter set at a time, so requesting any set other
> + * than the primary may fail if another process is sampling at the same time.
> + *
> + * If in doubt, the primary counter set has the most commonly used counters and requires no
> + * additional permissions to open.
> + */
> +enum drm_panthor_perf_counter_set {
> +	/**
> +	 * @DRM_PANTHOR_PERF_SET_PRIMARY: The default set configured on the hardware.
> +	 *
> +	 * This is the only set for which all counters in all blocks are defined.
> +	 */
> +	DRM_PANTHOR_PERF_SET_PRIMARY,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_SET_SECONDARY: The secondary performance counter set.
> +	 *
> +	 * Some blocks may not have any defined counters for this set, and the block will
> +	 * have the UNAVAILABLE block state permanently set in the block header.
> +	 *
> +	 * Accessing this set requires the calling process to have the CAP_PERFMON capability.
> +	 */
> +	DRM_PANTHOR_PERF_SET_SECONDARY,
> +
> +	/**
> +	 * @DRM_PANTHOR_PERF_SET_TERTIARY: The tertiary performance counter set.
> +	 *
> +	 * Some blocks may not have any defined counters for this set, and the block will have
> +	 * the UNAVAILABLE block state permanently set in the block header. Note that the
> +	 * tertiary set has the fewest defined counter blocks.
> +	 *
> +	 * Accessing this set requires the calling process to have the CAP_PERFMON capability.
> +	 */
> +	DRM_PANTHOR_PERF_SET_TERTIARY,
> +};
>  
>  /**
>   * struct drm_panthor_perf_cmd_setup - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
> @@ -1375,13 +1419,17 @@ struct drm_panthor_perf_control {
>   */
>  struct drm_panthor_perf_cmd_setup {
>  	/**
> -	 * @block_set: Set of performance counter blocks.
> +	 * @block_set: Set of performance counter blocks, member of
> +	 *             enum drm_panthor_perf_block_set.
>  	 *
>  	 * This is a global configuration and only one set can be active at a time. If
>  	 * another client has already requested a counter set, any further requests
>  	 * for a different counter set will fail and return an -EBUSY.
>  	 *
>  	 * If the requested set does not exist, the request will fail and return an -EINVAL.
> +	 *
> +	 * Some sets have additional requirements to be enabled, and the setup request will
> +	 * fail with an -EACCES if these requirements are not satisfied.
>  	 */
>  	__u8 block_set;
>  
> -- 
> 2.25.1


Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 2/8] drm/panthor: Add DEV_QUERY.PERF_INFO handling for Gx10
  2024-12-11 16:50 ` [RFC v2 2/8] drm/panthor: Add DEV_QUERY.PERF_INFO handling for Gx10 Lukas Zapolskas
  2025-01-27  9:56   ` Adrián Larumbe
@ 2025-01-27 22:17   ` Adrián Larumbe
  1 sibling, 0 replies; 28+ messages in thread
From: Adrián Larumbe @ 2025-01-27 22:17 UTC (permalink / raw)
  To: Lukas Zapolskas
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

On 11.12.2024 16:50, Lukas Zapolskas wrote:
> This change adds the IOCTL to query data about the performance counter
> setup. Some of this data was available via previous DEV_QUERY calls,
> for instance for GPU info, but exposing it via PERF_INFO
> minimizes the overhead of creating a single session to just the one
> aggregate IOCTL.
> 
> To better align the FW interfaces with the arch spec, the patch also
> renames perfcnt to prfcnt.
> 
> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
> ---
>  drivers/gpu/drm/panthor/Makefile         |  1 +
>  drivers/gpu/drm/panthor/panthor_device.h |  3 ++
>  drivers/gpu/drm/panthor/panthor_drv.c    | 11 +++++-
>  drivers/gpu/drm/panthor/panthor_fw.c     |  4 ++
>  drivers/gpu/drm/panthor/panthor_fw.h     |  4 ++
>  drivers/gpu/drm/panthor/panthor_perf.c   | 47 ++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_perf.h   | 12 ++++++
>  7 files changed, 81 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_perf.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_perf.h
> 
> diff --git a/drivers/gpu/drm/panthor/Makefile b/drivers/gpu/drm/panthor/Makefile
> index 15294719b09c..0df9947f3575 100644
> --- a/drivers/gpu/drm/panthor/Makefile
> +++ b/drivers/gpu/drm/panthor/Makefile
> @@ -9,6 +9,7 @@ panthor-y := \
>  	panthor_gpu.o \
>  	panthor_heap.o \
>  	panthor_mmu.o \
> +	panthor_perf.o \
>  	panthor_sched.o
>  
>  obj-$(CONFIG_DRM_PANTHOR) += panthor.o
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 0e68f5a70d20..636542c1dcbd 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -119,6 +119,9 @@ struct panthor_device {
>  	/** @csif_info: Command stream interface information. */
>  	struct drm_panthor_csif_info csif_info;
>  
> +	/** @perf_info: Performance counter interface information. */
> +	struct drm_panthor_perf_info perf_info;
> +
>  	/** @gpu: GPU management data. */
>  	struct panthor_gpu *gpu;
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index ad46a40ed9e1..e0ac3107c69e 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -175,7 +175,9 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
>  		 PANTHOR_UOBJ_DECL(struct drm_panthor_sync_op, timeline_value), \
>  		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
>  		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, ringbuf_size), \
> -		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs))
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs), \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_info, shader_blocks))
> +
>  
>  /**
>   * PANTHOR_UOBJ_SET() - Copy a kernel object to a user object.
> @@ -834,6 +836,10 @@ static int panthor_ioctl_dev_query(struct drm_device *ddev, void *data, struct d
>  			args->size = sizeof(priorities_info);
>  			return 0;
>  
> +		case DRM_PANTHOR_DEV_QUERY_PERF_INFO:
> +			args->size = sizeof(ptdev->perf_info);
> +			return 0;
> +
>  		default:
>  			return -EINVAL;
>  		}
> @@ -858,6 +864,9 @@ static int panthor_ioctl_dev_query(struct drm_device *ddev, void *data, struct d
>  		panthor_query_group_priorities_info(file, &priorities_info);
>  		return PANTHOR_UOBJ_SET(args->pointer, args->size, priorities_info);
>  
> +	case DRM_PANTHOR_DEV_QUERY_PERF_INFO:
> +		return PANTHOR_UOBJ_SET(args->pointer, args->size, ptdev->perf_info);
> +
>  	default:
>  		return -EINVAL;
>  	}
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> index 4a2e36504fea..e9530d1d9781 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -21,6 +21,7 @@
>  #include "panthor_gem.h"
>  #include "panthor_gpu.h"
>  #include "panthor_mmu.h"
> +#include "panthor_perf.h"
>  #include "panthor_regs.h"
>  #include "panthor_sched.h"
>  
> @@ -1417,6 +1418,9 @@ int panthor_fw_init(struct panthor_device *ptdev)
>  		goto err_unplug_fw;
>  
>  	panthor_fw_init_global_iface(ptdev);
> +
> +	panthor_perf_info_init(ptdev);

I think this might better go into end of panthor_device_init(), or inside panthor_perf_init()
since it doesn't program any FW interfaces.

> +
>  	return 0;
>  
>  err_unplug_fw:
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.h b/drivers/gpu/drm/panthor/panthor_fw.h
> index 22448abde992..db10358e24bb 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.h
> +++ b/drivers/gpu/drm/panthor/panthor_fw.h
> @@ -5,6 +5,7 @@
>  #define __PANTHOR_MCU_H__
>  
>  #include <linux/types.h>
> +#include <linux/spinlock.h>
>  
>  struct panthor_device;
>  struct panthor_kernel_bo;
> @@ -197,8 +198,11 @@ struct panthor_fw_global_control_iface {
>  	u32 output_va;
>  	u32 group_num;
>  	u32 group_stride;
> +#define GLB_PERFCNT_FW_SIZE(x) ((((x) >> 16) << 8))
>  	u32 perfcnt_size;
>  	u32 instr_features;
> +#define PERFCNT_FEATURES_MD_SIZE(x) ((x) & GENMASK(3, 0))
> +	u32 perfcnt_features;
>  };
>  
>  struct panthor_fw_global_input_iface {
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
> new file mode 100644
> index 000000000000..0e3d769c1805
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
> @@ -0,0 +1,47 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2023 Collabora Ltd */
> +/* Copyright 2024 Arm ltd. */
> +
> +#include <drm/drm_file.h>
> +#include <drm/drm_gem_shmem_helper.h>
> +#include <drm/drm_managed.h>
> +#include <drm/panthor_drm.h>
> +
> +#include "panthor_device.h"
> +#include "panthor_fw.h"
> +#include "panthor_gpu.h"
> +#include "panthor_perf.h"
> +#include "panthor_regs.h"
> +
> +/**
> + * PANTHOR_PERF_COUNTERS_PER_BLOCK - On CSF architectures pre-11.x, the number of counters
> + * per block was hardcoded to be 64. Arch 11.0 onwards supports the PRFCNT_FEATURES GPU register,
> + * which indicates the same information.
> + */
> +#define PANTHOR_PERF_COUNTERS_PER_BLOCK (64)
> +
> +void panthor_perf_info_init(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	struct drm_panthor_perf_info *const perf_info = &ptdev->perf_info;
> +
> +	if (PERFCNT_FEATURES_MD_SIZE(glb_iface->control->perfcnt_features))
> +		perf_info->flags |= DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT;
> +
> +	if (GPU_ARCH_MAJOR(ptdev->gpu_info.gpu_id) < 11)
> +		perf_info->counters_per_block = PANTHOR_PERF_COUNTERS_PER_BLOCK;
> +
> +	perf_info->sample_header_size = sizeof(struct drm_panthor_perf_sample_header);
> +	perf_info->block_header_size = sizeof(struct drm_panthor_perf_block_header);
> +
> +	if (GLB_PERFCNT_FW_SIZE(glb_iface->control->perfcnt_size)) {
> +		perf_info->fw_blocks = 1;
> +		perf_info->csg_blocks = glb_iface->control->group_num;
> +	}
> +
> +	perf_info->cshw_blocks = 1;
> +	perf_info->tiler_blocks = 1;
> +	perf_info->memsys_blocks = hweight64(ptdev->gpu_info.l2_present);
> +	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
> +}
> +
> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
> new file mode 100644
> index 000000000000..cff537a370c9
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> +/* Copyright 2024 Collabora Ltd */
> +/* Copyright 2024 Arm ltd. */
> +
> +#ifndef __PANTHOR_PERF_H__
> +#define __PANTHOR_PERF_H__
> +
> +struct panthor_device;
> +
> +void panthor_perf_info_init(struct panthor_device *ptdev);
> +
> +#endif /* __PANTHOR_PERF_H__ */
> -- 
> 2.25.1


Adrian Larumbe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 1/8] drm/panthor: Add performance counter uAPI
  2025-01-27  9:47   ` Adrián Larumbe
@ 2025-03-26 14:24     ` Lukas Zapolskas
  0 siblings, 0 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2025-03-26 14:24 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd

Hello Adrián,

Thank you for taking a look. I'm currently working on getting a v3 ready 
for this and to drop the RFC tag, and am going through all your comments.

On 27/01/2025 09:47, Adrián Larumbe wrote:
 > Hi Lukas,
 >
 > On 11.12.2024 16:50, Lukas Zapolskas wrote:
 >> This patch extends the DEV_QUERY ioctl to return information about the
 >> performance counter setup for userspace, and introduces the new
 >> ioctl DRM_PANTHOR_PERF_CONTROL in order to allow for the sampling of
 >> performance counters.
 >>
 >> The new design is inspired by the perf aux ringbuffer, with the insert
 >> and extract indices being mapped to userspace, allowing multiple samples
 >> to be exposed at any given time. To avoid pointer chasing, the sample
 >> metadata and block metadata are inline with the elements they
 >> describe.
 >>
 >> Userspace is responsible for passing in resources for samples to be
 >> exposed, including the event file descriptor for notification of new
 >> sample availability, the ringbuffer BO to store samples, and the control
 >> BO along with the offset for mapping the insert and extract indices.
 >> Though these indices are only a total of 8 bytes, userspace can then
 >> reuse the same physical page for tracking the state of multiple buffers
 >> by giving different offsets from the BO start to map them.
 >>
 >> Co-developed-by: Mihail Atanassov <mihail.atanassov@arm.com>
 >> Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
 >> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
 >> ---
 >>   include/uapi/drm/panthor_drm.h | 487 +++++++++++++++++++++++++++++++++
 >>   1 file changed, 487 insertions(+)
 >>
 >> diff --git a/include/uapi/drm/panthor_drm.h 
b/include/uapi/drm/panthor_drm.h
 >> index 87c9cb555dd1..8a431431da6b 100644
 >> --- a/include/uapi/drm/panthor_drm.h
 >> +++ b/include/uapi/drm/panthor_drm.h
 >> @@ -127,6 +127,9 @@ enum drm_panthor_ioctl_id {
 >>
 >>   	/** @DRM_PANTHOR_TILER_HEAP_DESTROY: Destroy a tiler heap. */
 >>   	DRM_PANTHOR_TILER_HEAP_DESTROY,
 >> +
 >> +	/** @DRM_PANTHOR_PERF_CONTROL: Control a performance counter 
session. */
 >> +	DRM_PANTHOR_PERF_CONTROL,
 >>   };
 >>
 >>   /**
 >> @@ -170,6 +173,8 @@ enum drm_panthor_ioctl_id {
 >>   	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_CREATE, tiler_heap_create)
 >>   #define DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY \
 >>   	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_DESTROY, tiler_heap_destroy)
 >> +#define DRM_IOCTL_PANTHOR_PERF_CONTROL \
 >> +	DRM_IOCTL_PANTHOR(WR, PERF_CONTROL, perf_control)
 >>
 >>   /**
 >>    * DOC: IOCTL arguments
 >> @@ -268,6 +273,9 @@ enum drm_panthor_dev_query_type {
 >>   	 * @DRM_PANTHOR_DEV_QUERY_GROUP_PRIORITIES_INFO: Query allowed 
group priorities information.
 >>   	 */
 >>   	DRM_PANTHOR_DEV_QUERY_GROUP_PRIORITIES_INFO,
 >> +
 >> +	/** @DRM_PANTHOR_DEV_QUERY_PERF_INFO: Query performance counter 
interface information. */
 >> +	DRM_PANTHOR_DEV_QUERY_PERF_INFO,
 >>   };
 >>
 >>   /**
 >> @@ -421,6 +429,120 @@ struct drm_panthor_group_priorities_info {
 >>   	__u8 pad[3];
 >>   };
 >>
 >> +/**
 >> + * enum drm_panthor_perf_feat_flags - Performance counter 
configuration feature flags.
 >> + */
 >> +enum drm_panthor_perf_feat_flags {
 >> +	/** @DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT: Coarse-grained block 
states are supported. */
 >> +	DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT = 1 << 0,
 >> +};
 >> +
 >> +/**
 >> + * enum drm_panthor_perf_block_type - Performance counter supported 
block types.
 >> + */
 >> +enum drm_panthor_perf_block_type {
 >> +	/** @DRM_PANTHOR_PERF_BLOCK_FW: The FW counter block. */
 >> +	DRM_PANTHOR_PERF_BLOCK_FW = 1,
 >> +
 >> +	/** @DRM_PANTHOR_PERF_BLOCK_CSG: A CSG counter block. */
 >> +	DRM_PANTHOR_PERF_BLOCK_CSG,
 >> +
 >> +	/** @DRM_PANTHOR_PERF_BLOCK_CSHW: The CSHW counter block. */
 >> +	DRM_PANTHOR_PERF_BLOCK_CSHW,
 >> +
 >> +	/** @DRM_PANTHOR_PERF_BLOCK_TILER: The tiler counter block. */
 >> +	DRM_PANTHOR_PERF_BLOCK_TILER,
 >> +
 >> +	/** @DRM_PANTHOR_PERF_BLOCK_MEMSYS: A memsys counter block. */
 >> +	DRM_PANTHOR_PERF_BLOCK_MEMSYS,
 >> +
 >> +	/** @DRM_PANTHOR_PERF_BLOCK_SHADER: A shader core counter block. */
 >> +	DRM_PANTHOR_PERF_BLOCK_SHADER,
 >> +};
 >> +
 >> +/**
 >> + * enum drm_panthor_perf_clock - Identifier of the clock used to 
produce the cycle count values
 >> + * in a given block.
 >> + *
 >> + * Since the integrator has the choice of using one or more clocks, 
there may be some confusion
 >> + * as to which blocks are counted by which clock values unless this 
information is explicitly
 >> + * provided as part of every block sample. Not every single clock 
here can be used: in the simplest
 >> + * case, all cycle counts will be associated with the top-level clock.
 >> + */
 >> +enum drm_panthor_perf_clock {
 >> +	/** @DRM_PANTHOR_PERF_CLOCK_TOPLEVEL: Top-level CSF clock. */
 >> +	DRM_PANTHOR_PERF_CLOCK_TOPLEVEL,
 >> +
 >> +	/**
 >> +	 * @DRM_PANTHOR_PERF_CLOCK_COREGROUP: Core group clock, 
responsible for the MMU, L2
 >> +	 * caches and the tiler.
 >> +	 */
 >> +	DRM_PANTHOR_PERF_CLOCK_COREGROUP,
 >> +
 >> +	/** @DRM_PANTHOR_PERF_CLOCK_SHADER: Clock for the shader cores. */
 >> +	DRM_PANTHOR_PERF_CLOCK_SHADER,
 >> +};
 >> +
 >> +/**
 >> + * struct drm_panthor_perf_info - Performance counter interface 
information
 >> + *
 >> + * Structure grouping all queryable information relating to the 
performance counter
 >> + * interfaces.
 >> + */
 >> +struct drm_panthor_perf_info {
 >> +	/**
 >> +	 * @counters_per_block: The number of 8-byte counters available in 
a block.
 >> +	 */
 >> +	__u32 counters_per_block;
 >> +
 >> +	/**
 >> +	 * @sample_header_size: The size of the header struct available at 
the beginning
 >> +	 * of every sample.
 >> +	 */
 >> +	__u32 sample_header_size;
 >> +
 >> +	/**
 >> +	 * @block_header_size: The size of the header struct inline with 
the counters for a
 >> +	 * single block.
 >> +	 */
 >> +	__u32 block_header_size;
 >> +
 >> +	/** @flags: Combination of drm_panthor_perf_feat_flags flags. */
 >> +	__u32 flags;
 >> +
 >> +	/**
 >> +	 * @supported_clocks: Bitmask of the clocks supported by the GPU.
 >> +	 *
 >> +	 * Each bit represents a variant of the enum drm_panthor_perf_clock.
 >> +	 *
 >> +	 * For the same GPU, different implementers may have different 
clocks for the same hardware
 >> +	 * block. At the moment, up to four clocks are supported, and any 
clocks that are present
 >> +	 * will be reported here.
 >> +	 */
 >> +	__u32 supported_clocks;
 >> +
 >> +	/** @fw_blocks: Number of FW blocks available. */
 >> +	__u32 fw_blocks;
 >> +
 >> +	/** @csg_blocks: Number of CSG blocks available. */
 >> +	__u32 csg_blocks;
 >> +
 >> +	/** @cshw_blocks: Number of CSHW blocks available. */
 >> +	__u32 cshw_blocks;
 >> +
 >> +	/** @tiler_blocks: Number of tiler blocks available. */
 >> +	__u32 tiler_blocks;
 >> +
 >> +	/** @memsys_blocks: Number of memsys blocks available. */
 >> +	__u32 memsys_blocks;
 >> +
 >> +	/** @shader_blocks: Number of shader core blocks available. */
 >> +	__u32 shader_blocks;
 >> +
 >> +	/** @pad: MBZ. */
 >> +	__u32 pad;
 >> +};
 >> +
 >>   /**
 >>    * struct drm_panthor_dev_query - Arguments passed to 
DRM_PANTHOR_IOCTL_DEV_QUERY
 >>    */
 >> @@ -1010,6 +1132,371 @@ struct drm_panthor_tiler_heap_destroy {
 >>   	__u32 pad;
 >>   };
 >>
 >> +/**
 >> + * DOC: Performance counter decoding in userspace.
 >> + *
 >> + * Each sample will be exposed to userspace in the following manner:
 >> + *
 >> + * 
+--------+--------+------------------------+--------+-------------------------+-----+
 >> + * | Sample | Block  |        Block           | Block  | 
Block           | ... |
 >> + * | header | header |        counters        | header | 
counters        |     |
 >> + * 
+--------+--------+------------------------+--------+-------------------------+-----+
 >> + *
 >> + * Each sample will start with a sample header of type @struct 
drm_panthor_perf_sample header,
 >> + * providing sample-wide information like the start and end 
timestamps, the counter set currently
 >> + * configured, and any errors that may have occurred during sampling.
 >> + *
 >> + * After the fixed size header, the sample will consist of blocks of
 >> + * 64-bit @drm_panthor_dev_query_perf_info::counters_per_block 
counters, each prefaced with a
 >> + * header of its own, indicating source block type, as well as the 
cycle count needed to normalize
 >> + * cycle values within that block, and a clock source identifier.
 >> + */
 >
 > At first I was a bit confused about this header, because I could not 
find it anywhere in the spec.
 > Then I realised it's been devised specifically for user samples. Is 
it really impossible for
 > user space to be able to pick up these values from the FW sample 
itself, other than the
 > timestamp and cycles values? I think as of lately some of these can 
also be queried from UM.

That's right, this is done specifically for user samples. Going through
all of the fields, the only one that is easily obtainable from userspace
would be the timestamp_end_ns, since it will be present in the counter
header (after some adjustment).

Starting from the sample header, the flags are a catch-all if we want to
indicate to the user that something has gone wrong, and there is no
separate mechanism to report this.

The block set is technically known by userspace, but not reported
anywhere. Having it as part of the sample header allows for simpler
userspace parsing.

The user data is effectively a way of tagging a sample. Let's say you
make a manual session, then sample multiple times, and stop it. The user
data gives you a way of attaching some user-relevant data to each of
these samples in the ring buffer. This could be something like the frame 
number, or even a pointer to some userspace data. The kernel does not 
interpret this in any way.

Together, having the block set and the user data allows for context-less 
parsing: upon receiving a sample, you can read the header, obtain the 
tracking structure, and then walk the sample looking at each block 
header to gain full context of the data you are trying to read.

For the cycle counts, I haven't seen a good way of knowing how many
cycles elapsed over the sampling period from userspace, given that the
precise moment the sample was taken was controlled by the firmware, and 
devfreq may have kicked in several times while sampling.

Looking at the block header, the block type also facilitates doing more
context-less parsing. I see that you've commented on having the FW ring
buffer handling in the kernel on another patch, and the rationale is as 
follows: we have tried doing something similar with one of the earlier
performance counter interface in mali_kbase with the vinstr interface.
Implementing the userspace for it was very difficult, and the interface
was not at all flexible to changes in the arch spec, the underlying
hardware or the firmware.

However, for CSF, we have had three new counter blocks added to date. 
These are tied not to the HW revision but to the firmware. Having the 
kernel be in charge of interpreting what blocks are available and their 
offsets allow us to not break existing userspaces when this happens, 
along with decoupling userspace from the layout algorithm. There is no 
good way to indicate to userspace when the layout changes in 
non-backwards compatible ways.

Not to mention, the kernel already has to do some interpretation of the 
spec in order to do spec-compliant sampling: the PRFCNT_EN must be 
zeroed after a FW sample is consumed, so we need to know its offset. If 
the metadata block is present, the sample_reason must also be zeroed to 
signal the same. And finally, for the user to get a useful timestamp, we 
have to adjust the MCU timestamps to the GPU timestamps, so we need to 
know their offset too.

Minimal kernel interepretation of the data also allows us to do certain 
optimizations with the memory layout without having to change the 
underlying layout algorithm. We can ignore the effect of sparseness on 
the shader core counter layout, which on platforms like the Rock 5B can 
lead to significant memory savings.

And finally, perhaps the weakest argument is that we have multiple 
userspaces, and moving a lot of the complexity to userspace may cause 
more problems. We cannot regress libGPUCounters [1], which is used as 
part of external projects already, and having two independent 
implementations increases the risk of divergence.

Hopefully that clears up the major reasons for the uAPI design we have 
chosen here

[1]: https://github.com/ARM-software/libGPUCounters

 >> +/**
 >> + * enum drm_panthor_perf_block_state - Bitmask of the power and 
execution states that an individual
 >> + * hardware block went through in a sampling period.
 >> + *
 >> + * Because the sampling period is controlled from userspace, the 
block may undergo multiple
 >> + * state transitions, so this must be interpreted as one or more 
such transitions occurring.
 >> + */
 >> +enum drm_panthor_perf_block_state {
 >> +	/**
 >> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_UNKNOWN: The state of this block 
was unknown during
 >> +	 * the sampling period.
 >> +	 */
 >> +	DRM_PANTHOR_PERF_BLOCK_STATE_UNKNOWN = 0,
 >> +
 >> +	/**
 >> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_ON: This block was powered on for 
some or all of
 >> +	 * the sampling period.
 >> +	 */
 >> +	DRM_PANTHOR_PERF_BLOCK_STATE_ON = 1 << 0,
 >> +
 >> +	/**
 >> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_OFF: This block was powered off 
for some or all of the
 >> +	 * sampling period.
 >> +	 */
 >> +	DRM_PANTHOR_PERF_BLOCK_STATE_OFF = 1 << 1,
 >> +
 >> +	/**
 >> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_AVAILABLE: This block was 
available for execution for
 >> +	 * some or all of the sampling period.
 >> +	 */
 >> +	DRM_PANTHOR_PERF_BLOCK_STATE_AVAILABLE = 1 << 2,
 >> +	/**
 >> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_UNAVAILABLE: This block was 
unavailable for execution for
 >> +	 * some or all of the sampling period.
 >> +	 */
 >> +	DRM_PANTHOR_PERF_BLOCK_STATE_UNAVAILABLE = 1 << 3,
 >> +
 >> +	/**
 >> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_NORMAL: This block was executing 
in normal mode
 >> +	 * for some or all of the sampling period.
 >> +	 */
 >> +	DRM_PANTHOR_PERF_BLOCK_STATE_NORMAL = 1 << 4,
 >> +
 >> +	/**
 >> +	 * @DRM_PANTHOR_PERF_BLOCK_STATE_PROTECTED: This block was 
executing in protected mode
 >> +	 * for some or all of the sampling period.
 >> +	 */
 >> +	DRM_PANTHOR_PERF_BLOCK_STATE_PROTECTED = 1 << 5,
 >> +};
 >> +
 >> +/**
 >> + * struct drm_panthor_perf_block_header - Header present before 
every block in the
 >> + * sample ringbuffer.
 >> + */
 >> +struct drm_panthor_perf_block_header {
 >> +	/** @block_type: Type of the block. */
 >> +	__u8 block_type;
 >> +
 >> +	/** @block_idx: Block index. */
 >> +	__u8 block_idx;
 >> +
 >> +	/**
 >> +	 * @block_states: Coarse-grained block transitions, bitmask of enum
 >> +	 * drm_panthor_perf_block_states.
 >> +	 */
 >> +	__u8 block_states;
 >> +
 >> +	/**
 >> +	 * @clock: Clock used to produce the cycle count for this block, 
taken from
 >> +	 * enum drm_panthor_perf_clock. The cycle counts are stored in the 
sample header.
 >> +	 */
 >> +	__u8 clock;
 >> +
 >> +	/** @pad: MBZ. */
 >> +	__u8 pad[4];
 >> +
 >> +	/** @enable_mask: Bitmask of counters requested during the session 
setup. */
 >> +	__u64 enable_mask[2];
 >> +};
 >> +
 >> +/**
 >> + * enum drm_panthor_perf_sample_flags - Sample-wide events that 
occurred over the sampling
 >> + * period.
 >> + */
 >> +enum drm_panthor_perf_sample_flags {
 >> +	/**
 >> +	 * @DRM_PANTHOR_PERF_SAMPLE_OVERFLOW: This sample contains 
overflows due to the duration
 >> +	 * of the sampling period.
 >> +	 */
 >> +	DRM_PANTHOR_PERF_SAMPLE_OVERFLOW = 1 << 0,
 >> +
 >> +	/**
 >> +	 * @DRM_PANTHOR_PERF_SAMPLE_ERROR: This sample encountered an 
error condition during
 >> +	 * the sample duration.
 >> +	 */
 >> +	DRM_PANTHOR_PERF_SAMPLE_ERROR = 1 << 1,
 >> +};
 >> +
 >> +/**
 >> + * struct drm_panthor_perf_sample_header - Header present before 
every sample.
 >> + */
 >> +struct drm_panthor_perf_sample_header {
 >> +	/**
 >> +	 * @timestamp_start_ns: Earliest timestamp that values in this 
sample represent, in
 >> +	 * nanoseconds. Derived from CLOCK_MONOTONIC_RAW.
 >> +	 */
 >> +	__u64 timestamp_start_ns;
 >> +
 >> +	/**
 >> +	 * @timestamp_end_ns: Latest timestamp that values in this sample 
represent, in
 >> +	 * nanoseconds. Derived from CLOCK_MONOTONIC_RAW.
 >> +	 */
 >> +	__u64 timestamp_end_ns;
 >> +
 >> +	/** @block_set: Set of performance counter blocks. */
 >> +	__u8 block_set;
 >> +
 >> +	/** @pad: MBZ. */
 >> +	__u8 pad[3];
 >> +
 >> +	/** @flags: Current sample flags, combination of 
drm_panthor_perf_sample_flags. */
 >> +	__u32 flags;
 >> +
 >> +	/**
 >> +	 * @user_data: User data provided as part of the command that 
triggered this sample.
 >> +	 *
 >> +	 * - Automatic samples (periodic ones or those around non-counting 
periods or power state
 >> +	 * transitions) will be tagged with the user_data provided as part 
of the
 >> +	 * DRM_PANTHOR_PERF_COMMAND_START call.
 >> +	 * - Manual samples will be tagged with the user_data provided 
with the
 >> +	 * DRM_PANTHOR_PERF_COMMAND_SAMPLE call.
 >> +	 * - A session's final automatic sample will be tagged with the 
user_data provided with the
 >> +	 * DRM_PANTHOR_PERF_COMMAND_STOP call.
 >> +	 */
 >> +	__u64 user_data;
 >> +
 >> +	/**
 >> +	 * @toplevel_clock_cycles: The number of cycles elapsed between
 >> +	 * drm_panthor_perf_sample_header::timestamp_start_ns and
 >> +	 * drm_panthor_perf_sample_header::timestamp_end_ns on the 
top-level clock if the
 >> +	 * corresponding bit is set in 
drm_panthor_perf_info::supported_clocks.
 >> +	 */
 >> +	__u64 toplevel_clock_cycles;
 >> +
 >> +	/**
 >> +	 * @coregroup_clock_cycles: The number of cycles elapsed between
 >> +	 * drm_panthor_perf_sample_header::timestamp_start_ns and
 >> +	 * drm_panthor_perf_sample_header::timestamp_end_ns on the 
coregroup clock if the
 >> +	 * corresponding bit is set in 
drm_panthor_perf_info::supported_clocks.
 >> +	 */
 >> +	__u64 coregroup_clock_cycles;
 >> +
 >> +	/**
 >> +	 * @shader_clock_cycles: The number of cycles elapsed between
 >> +	 * drm_panthor_perf_sample_header::timestamp_start_ns and
 >> +	 * drm_panthor_perf_sample_header::timestamp_end_ns on the shader 
core clock if the
 >> +	 * corresponding bit is set in 
drm_panthor_perf_info::supported_clocks.
 >> +	 */
 >> +	__u64 shader_clock_cycles;
 >> +};
 >> +
 >> +/**
 >> + * enum drm_panthor_perf_command - Command type passed to the 
DRM_PANTHOR_PERF_CONTROL
 >> + * IOCTL.
 >> + */
 >> +enum drm_panthor_perf_command {
 >> +	/** @DRM_PANTHOR_PERF_COMMAND_SETUP: Create a new performance 
counter sampling context. */
 >> +	DRM_PANTHOR_PERF_COMMAND_SETUP,
 >> +
 >> +	/** @DRM_PANTHOR_PERF_COMMAND_TEARDOWN: Teardown a performance 
counter sampling context. */
 >> +	DRM_PANTHOR_PERF_COMMAND_TEARDOWN,
 >> +
 >> +	/** @DRM_PANTHOR_PERF_COMMAND_START: Start a sampling session on 
the indicated context. */
 >> +	DRM_PANTHOR_PERF_COMMAND_START,
 >> +
 >> +	/** @DRM_PANTHOR_PERF_COMMAND_STOP: Stop the sampling session on 
the indicated context. */
 >> +	DRM_PANTHOR_PERF_COMMAND_STOP,
 >> +
 >> +	/**
 >> +	 * @DRM_PANTHOR_PERF_COMMAND_SAMPLE: Request a manual sample on 
the indicated context.
 >> +	 *
 >> +	 * When the sampling session is configured with a non-zero 
sampling frequency, any
 >> +	 * DRM_PANTHOR_PERF_CONTROL calls with this command will be 
ignored and return an
 >> +	 * -EINVAL.
 >> +	 */
 >> +	DRM_PANTHOR_PERF_COMMAND_SAMPLE,
 >> +};
 >> +
 >> +/**
 >> + * struct drm_panthor_perf_control - Arguments passed to 
DRM_PANTHOR_IOCTL_PERF_CONTROL.
 >> + */
 >> +struct drm_panthor_perf_control {
 >> +	/** @cmd: Command from enum drm_panthor_perf_command. */
 >> +	__u32 cmd;
 >> +
 >> +	/**
 >> +	 * @handle: session handle.
 >> +	 *
 >> +	 * Returned by the DRM_PANTHOR_PERF_COMMAND_SETUP call.
 >> +	 * It must be used in subsequent commands for the same context.
 >> +	 */
 >> +	__u32 handle;
 >> +
 >> +	/**
 >> +	 * @size: size of the command structure.
 >> +	 *
 >> +	 * If the pointer is NULL, the size is updated by the driver to 
provide the size of the
 >> +	 * output structure. If the pointer is not NULL, the driver will 
only copy min(size,
 >> +	 * struct_size) to the pointer and update the size accordingly.
 >> +	 */
 >> +	__u64 size;
 >> +
 >> +	/** @pointer: user pointer to a command type struct. */
 >> +	__u64 pointer;
 >> +};
 >> +
 >> +
 >> +/**
 >> + * struct drm_panthor_perf_cmd_setup - Arguments passed to 
DRM_PANTHOR_IOCTL_PERF_CONTROL
 >> + * when the DRM_PANTHOR_PERF_COMMAND_SETUP command is specified.
 >> + */
 >> +struct drm_panthor_perf_cmd_setup {
 >> +	/**
 >> +	 * @block_set: Set of performance counter blocks.
 >> +	 *
 >> +	 * This is a global configuration and only one set can be active 
at a time. If
 >> +	 * another client has already requested a counter set, any further 
requests
 >> +	 * for a different counter set will fail and return an -EBUSY.
 >> +	 *
 >> +	 * If the requested set does not exist, the request will fail and 
return an -EINVAL.
 >> +	 */
 >> +	__u8 block_set;
 >
 > How do we know for a given hardware model, what block sets it 
supports? When I wrote the
 > implementation of perfcnt for Panthor we're using at Collabora right 
now, that was a question
 > I could never find an answer for in the spec.

Unfortunately, there is no good way to query it from the hardware, and 
it's not considered a breaking change to add or remove support for a set 
of a particular block between GPU revisions. My intent will be to 
hardcode a data table into panthor_perf.c eventually, based on the GPU ID.

 >
 >> +	/** @pad: MBZ. */
 >> +	__u8 pad[7];
 >> +
 >> +	/** @fd: eventfd for signalling the availability of a new sample. */
 >> +	__u32 fd;
 >> +
 >> +	/** @ringbuf_handle: Handle to the BO to write perf counter sample 
to. */
 >> +	__u32 ringbuf_handle;
 >
 > If UM is in charge of creating this BO, how would it know how big it 
should be? I suppose this
 > would be conveyed by perf info returned by panthor_ioctl_dev_query().

That is the intent, yes. Currently this is the only "implicit" part of 
the interface, where we expect the user to calculate the sample size on 
their own, based on the data passed in dev_query. I'm wondering if it 
wouldn't be simpler to, alongside the current dev_query fields, to also 
provide the user with the sample size and the block size directly.

 >> +	/**
 >> +	 * @control_handle: Handle to the BO containing a contiguous 16 
byte range, used for the
 >> +	 * insert and extract indices for the ringbuffer.
 >> +	 */
 >> +	__u32 control_handle;
 >> +
 >> +	/**
 >> +	 * @sample_slots: The number of slots available in the 
userspace-provided BO. Must be
 >> +	 * a power of 2.
 >> +	 *
 >> +	 * If sample_slots * sample_size does not match the BO size, the 
setup request will fail.
 >> +	 */
 >> +	__u32 sample_slots;
 >
 > Does that mean that the number of user bo slots can be different than 
the kernel ringbuffer one?
 >
That's correct. I don't think there is a compelling reason to require 
both of them to be the same. Userspace could for instance decide to do a 
lot of samples, once a frame, periodically or tied to some relevant 
event, and interpret all of the counters in one go at the end. That way 
it does not matter how big/small the kernel to FW ring buffer is, just 
how much space the user chose to allocate for their BO.

 >> +
 >> +	/**
 >> +	 * @control_offset: Offset into the control BO where the insert 
and extract indices are
 >> +	 * located.
 >> +	 */
 >> +	__u64 control_offset;
 >> +
 >> +	/**
 >> +	 * @sample_freq_ns: Period between automatic counter sample 
collection in nanoseconds. Zero
 >> +	 * disables automatic collection and all collection must be done 
through explicit calls
 >> +	 * to DRM_PANTHOR_PERF_CONTROL.SAMPLE. Non-zero values will 
disable manual counter sampling
 >> +	 * via the DRM_PANTHOR_PERF_COMMAND_SAMPLE command.
 >> +	 *
 >> +	 * This disables software-triggered periodic sampling, but 
hardware will still trigger
 >> +	 * automatic samples on certain events, including shader core 
power transitions, and
 >> +	 * entries to and exits from non-counting periods. The final stop 
command will also
 >> +	 * trigger a sample to ensure no data is lost.
 >> +	 */
 >> +	__u64 sample_freq_ns;
 >> +
 >> +	/**
 >> +	 * @fw_enable_mask: Bitmask of counters to request from the FW 
counter block. Any bits
 >> +	 * past the first drm_panthor_perf_info.counters_per_block bits 
will be ignored.
 >> +	 */
 >> +	__u64 fw_enable_mask[2];
 >> +
 >> +	/**
 >> +	 * @csg_enable_mask: Bitmask of counters to request from the CSG 
counter blocks. Any bits
 >> +	 * past the first drm_panthor_perf_info.counters_per_block bits 
will be ignored.
 >> +	 */
 >> +	__u64 csg_enable_mask[2];
 >> +
 >> +	/**
 >> +	 * @cshw_enable_mask: Bitmask of counters to request from the CSHW 
counter block. Any bits
 >> +	 * past the first drm_panthor_perf_info.counters_per_block bits 
will be ignored.
 >> +	 */
 >> +	__u64 cshw_enable_mask[2];
 >> +
 >> +	/**
 >> +	 * @tiler_enable_mask: Bitmask of counters to request from the 
tiler counter block. Any
 >> +	 * bits past the first drm_panthor_perf_info.counters_per_block 
bits will be ignored.
 >> +	 */
 >> +	__u64 tiler_enable_mask[2];
 >> +
 >> +	/**
 >> +	 * @memsys_enable_mask: Bitmask of counters to request from the 
memsys counter blocks. Any
 >> +	 * bits past the first drm_panthor_perf_info.counters_per_block 
bits will be ignored.
 >> +	 */
 >> +	__u64 memsys_enable_mask[2];
 >> +
 >> +	/**
 >> +	 * @shader_enable_mask: Bitmask of counters to request from the 
shader core counter blocks.
 >> +	 * Any bits past the first 
drm_panthor_perf_info.counters_per_block bits will be ignored.
 >> +	 */
 >> +	__u64 shader_enable_mask[2];
 >> +};
 >> +
 >> +/**
 >> + * struct drm_panthor_perf_cmd_start - Arguments passed to 
DRM_PANTHOR_IOCTL_PERF_CONTROL
 >> + * when the DRM_PANTHOR_PERF_COMMAND_START command is specified.
 >> + */
 >> +struct drm_panthor_perf_cmd_start {
 >> +	/**
 >> +	 * @user_data: User provided data that will be attached to 
automatic samples collected
 >> +	 * until the next DRM_PANTHOR_PERF_COMMAND_STOP.
 >> +	 */
 >> +	__u64 user_data;
 >
 > What is this user data pointer being used for in the samples? What 
kind of information would
 > it normally add by having it written into the user samples?

Hopefully I have answered this above, but to re-iterate: it gives 
context as to what sample is associated with what request. If one 
chooses the model of interpreting all of the performance data at the end 
of the content run, then this allows the user to disambiguate all of the 
samples. This eases userspace parsing by not requiring the user to keep 
track of how many samples they have consumed and what event corresponds 
to which sample request.

 >> +};
 >> +
 >> +/**
 >> + * struct drm_panthor_perf_cmd_stop - Arguments passed to 
DRM_PANTHOR_IOCTL_PERF_CONTROL
 >> + * when the DRM_PANTHOR_PERF_COMMAND_STOP command is specified.
 >> + */
 >> +struct drm_panthor_perf_cmd_stop {
 >> +	/**
 >> +	 * @user_data: User provided data that will be attached to the 
automatic sample collected
 >> +	 * at the end of this sampling session.
 >> +	 */
 >> +	__u64 user_data;
 >> +};
 >> +
 >> +/**
 >> + * struct drm_panthor_perf_cmd_sample - Arguments passed to 
DRM_PANTHOR_IOCTL_PERF_CONTROL
 >> + * when the DRM_PANTHOR_PERF_COMMAND_SAMPLE command is specified.
 >> + */
 >> +struct drm_panthor_perf_cmd_sample {
 >> +	/** @user_data: User provided data that will be attached to the 
sample.*/
 >> +	__u64 user_data;
 >> +};
 >> +
 >>   #if defined(__cplusplus)
 >>   }
 >>   #endif
 >> --
 >> 2.25.1
 >
 > Adrian Larumbe

Kind regards,
Lukas Zapolskas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 3/8] drm/panthor: Add panthor_perf_init and panthor_perf_unplug
  2025-01-27 12:46   ` Adrián Larumbe
@ 2025-03-26 14:36     ` Lukas Zapolskas
  0 siblings, 0 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2025-03-26 14:36 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd


(please excuse the line-wrapping on the previous email,
it seems that it was soft-wrapped).

On 27/01/2025 12:46, Adrián Larumbe wrote:
> On 11.12.2024 16:50, Lukas Zapolskas wrote:
>> Added the panthor_perf system initialization and unplug code to allow
>> for the handling of userspace sessions to be added in follow-up patches.
>>
>> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
>> ---
>>   drivers/gpu/drm/panthor/panthor_device.c |  7 +++
>>   drivers/gpu/drm/panthor/panthor_device.h |  5 +-
>>   drivers/gpu/drm/panthor/panthor_perf.c   | 77 ++++++++++++++++++++++++
>>   drivers/gpu/drm/panthor/panthor_perf.h   |  3 +
>>   4 files changed, 91 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
>> index 00f7b8ce935a..1a81a436143b 100644
>> --- a/drivers/gpu/drm/panthor/panthor_device.c
>> +++ b/drivers/gpu/drm/panthor/panthor_device.c
>> @@ -19,6 +19,7 @@
>>   #include "panthor_fw.h"
>>   #include "panthor_gpu.h"
>>   #include "panthor_mmu.h"
>> +#include "panthor_perf.h"
>>   #include "panthor_regs.h"
>>   #include "panthor_sched.h"
>>   
>> @@ -97,6 +98,7 @@ void panthor_device_unplug(struct panthor_device *ptdev)
>>   	/* Now, try to cleanly shutdown the GPU before the device resources
>>   	 * get reclaimed.
>>   	 */
>> +	panthor_perf_unplug(ptdev);
>>   	panthor_sched_unplug(ptdev);
>>   	panthor_fw_unplug(ptdev);
>>   	panthor_mmu_unplug(ptdev);
>> @@ -262,6 +264,10 @@ int panthor_device_init(struct panthor_device *ptdev)
>>   	if (ret)
>>   		goto err_unplug_fw;
>>   
>> +	ret = panthor_perf_init(ptdev);
>> +	if (ret)
>> +		goto err_unplug_fw;
>> +
>>   	/* ~3 frames */
>>   	pm_runtime_set_autosuspend_delay(ptdev->base.dev, 50);
>>   	pm_runtime_use_autosuspend(ptdev->base.dev);
>> @@ -275,6 +281,7 @@ int panthor_device_init(struct panthor_device *ptdev)
>>   
>>   err_disable_autosuspend:
>>   	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
>> +	panthor_perf_unplug(ptdev);
>>   	panthor_sched_unplug(ptdev);
>>   
>>   err_unplug_fw:
>> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
>> index 636542c1dcbd..aca33d03036c 100644
>> --- a/drivers/gpu/drm/panthor/panthor_device.h
>> +++ b/drivers/gpu/drm/panthor/panthor_device.h
>> @@ -26,7 +26,7 @@ struct panthor_heap_pool;
>>   struct panthor_job;
>>   struct panthor_mmu;
>>   struct panthor_fw;
>> -struct panthor_perfcnt;
>> +struct panthor_perf;
>>   struct panthor_vm;
>>   struct panthor_vm_pool;
>>   
>> @@ -137,6 +137,9 @@ struct panthor_device {
>>   	/** @devfreq: Device frequency scaling management data. */
>>   	struct panthor_devfreq *devfreq;
>>   
>> +	/** @perf: Performance counter management data. */
>> +	struct panthor_perf *perf;
>> +
>>   	/** @unplug: Device unplug related fields. */
>>   	struct {
>>   		/** @lock: Lock used to serialize unplug operations. */
>> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
>> index 0e3d769c1805..e0dc6c4b0cf1 100644
>> --- a/drivers/gpu/drm/panthor/panthor_perf.c
>> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
>> @@ -13,6 +13,24 @@
>>   #include "panthor_perf.h"
>>   #include "panthor_regs.h"
>>   
>> +struct panthor_perf {
>> +	/**
>> +	 * @block_set: The global counter set configured onto the HW.
>> +	 */
>> +	u8 block_set;
> 
> I think this field is not used in any further patches. Only in the sampler
> struct definition later on you include the same field and assign it from
> the ioctl setup arguments.

Will have to correct that, it should be used for the FW programming and
checking whether a session creation request can be satisfied.

> 
>> +	/** @next_session: The ID of the next session. */
>> +	u32 next_session;
>> +
>> +	/** @session_range: The number of sessions supported at a time. */
>> +	struct xa_limit session_range;
>> +
>> +	/**
>> +	 * @sessions: Global map of sessions, accessed by their ID.
>> +	 */
>> +	struct xarray sessions;
>> +};
>> +
>>   /**
>>    * PANTHOR_PERF_COUNTERS_PER_BLOCK - On CSF architectures pre-11.x, the number of counters
>>    * per block was hardcoded to be 64. Arch 11.0 onwards supports the PRFCNT_FEATURES GPU register,
>> @@ -45,3 +63,62 @@ void panthor_perf_info_init(struct panthor_device *ptdev)
>>   	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
>>   }
>>   
>> +/**
>> + * panthor_perf_init - Initialize the performance counter subsystem.
>> + * @ptdev: Panthor device
>> + *
>> + * The performance counters require the FW interface to be available to setup the
>> + * sampling ringbuffers, so this must be called only after FW is initialized.
>> + *
>> + * Return: 0 on success, negative error code on failure.
>> + */
>> +int panthor_perf_init(struct panthor_device *ptdev)
>> +{
>> +	struct panthor_perf *perf;
>> +
>> +	if (!ptdev)
>> +		return -EINVAL;
>> +
>> +	perf = devm_kzalloc(ptdev->base.dev, sizeof(*perf), GFP_KERNEL);
>> +	if (ZERO_OR_NULL_PTR(perf))
>> +		return -ENOMEM;
>> +
>> +	xa_init_flags(&perf->sessions, XA_FLAGS_ALLOC);
>> +
>> +	/* Currently, we only support a single session at a time. */
>> +	perf->session_range = (struct xa_limit) {
>> +		.min = 0,
>> +		.max = 1,
>> +	};
> 
> I guess at the moment we only allow a single session because periodic sampling
> isn't yet implemented. Does that mean multisession support will not be made
> available for manual samplers in the future?

The RFC was intended to purely give a functional implementation without
periodic or multi-client sampling. Multi-client will be available for
both periodic and manual sessions, once periodic is implemented.

>> +
>> +	drm_info(&ptdev->base, "Performance counter subsystem initialized");
>> +
>> +	ptdev->perf = perf;
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * panthor_perf_unplug - Terminate the performance counter subsystem.
>> + * @ptdev: Panthor device.
>> + *
>> + * This function will terminate the performance counter control structures and any remaining
>> + * sessions, after waiting for any pending interrupts.
>> + */
>> +void panthor_perf_unplug(struct panthor_device *ptdev)
>> +{
>> +	struct panthor_perf *perf = ptdev->perf;
>> +
>> +	if (!perf)
>> +		return;
>> +
>> +	if (!xa_empty(&perf->sessions))
>> +		drm_err(&ptdev->base,
>> +				"Performance counter sessions active when unplugging the driver!");
>> +
>> +	xa_destroy(&perf->sessions);
>> +
>> +	devm_kfree(ptdev->base.dev, ptdev->perf);
> 
> If we always call devm_kfree, then what is the point of allocating ptdev->perf
> with devm_kzalloc?

You're right, it's not necessary. The panthor_perf object could in
theory be allocated with devm, since the lifetime matches that of the
panthor_device, but it's simpler to drop the devm allocations.

> 
>> +	ptdev->perf = NULL;
>> +}
>> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
>> index cff537a370c9..90af8b18358c 100644
>> --- a/drivers/gpu/drm/panthor/panthor_perf.h
>> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
>> @@ -9,4 +9,7 @@ struct panthor_device;
>>   
>>   void panthor_perf_info_init(struct panthor_device *ptdev);
>>   
>> +int panthor_perf_init(struct panthor_device *ptdev);
>> +void panthor_perf_unplug(struct panthor_device *ptdev);
>> +
>>   #endif /* __PANTHOR_PERF_H__ */
>> -- 
>> 2.25.1
> 
> 
> Adrian Larumbe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 4/8] drm/panthor: Add panthor perf ioctls
  2025-01-27 14:06   ` Adrián Larumbe
@ 2025-03-26 14:40     ` Lukas Zapolskas
  0 siblings, 0 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2025-03-26 14:40 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd



On 27/01/2025 14:06, Adrián Larumbe wrote:
> On 11.12.2024 16:50, Lukas Zapolskas wrote:
>> This patch implements the PANTHOR_PERF_CONTROL ioctl series, and
>> a PANTHOR_GET_UOBJ wrapper to deal with the backwards and forwards
>> compatibility of the uAPI.
>>
>> Stub function definitions are added to ensure the patch builds on its own,
>> and will be removed later in the series.
>>
>> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
>> ---
>>   drivers/gpu/drm/panthor/panthor_drv.c  | 155 ++++++++++++++++++++++++-
>>   drivers/gpu/drm/panthor/panthor_perf.c |  34 ++++++
>>   drivers/gpu/drm/panthor/panthor_perf.h |  19 +++
>>   3 files changed, 206 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
>> index e0ac3107c69e..458175f58b15 100644
>> --- a/drivers/gpu/drm/panthor/panthor_drv.c
>> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
>> @@ -7,6 +7,7 @@
>>   #include <asm/arch_timer.h>
>>   #endif
>>   
>> +#include <linux/cleanup.h>
>>   #include <linux/list.h>
>>   #include <linux/module.h>
>>   #include <linux/of_platform.h>
>> @@ -31,6 +32,7 @@
>>   #include "panthor_gpu.h"
>>   #include "panthor_heap.h"
>>   #include "panthor_mmu.h"
>> +#include "panthor_perf.h"
>>   #include "panthor_regs.h"
>>   #include "panthor_sched.h"
>>   
>> @@ -73,6 +75,39 @@ panthor_set_uobj(u64 usr_ptr, u32 usr_size, u32 min_size, u32 kern_size, const v
>>   	return 0;
>>   }
>>   
>> +/**
>> + * panthor_get_uobj() - Copy kernel object to user object.
>> + * @usr_ptr: Users pointer.
>> + * @usr_size: Size of the user object.
>> + * @min_size: Minimum size for this object.
>> + *
>> + * Helper automating kernel -> user object copies.
>> + *
>> + * Don't use this function directly, use PANTHOR_UOBJ_GET() instead.
>> + *
>> + * Return: valid pointer on success, an encoded error code otherwise.
>> + */
>> +static void*
>> +panthor_get_uobj(u64 usr_ptr, u32 usr_size, u32 min_size)
>> +{
>> +	int ret;
>> +	void *out_alloc __free(kvfree) = NULL;
>> +
>> +	/* User size shouldn't be smaller than the minimal object size. */
>> +	if (usr_size < min_size)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	out_alloc = kvmalloc(min_size, GFP_KERNEL);
>> +	if (!out_alloc)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	ret = copy_struct_from_user(out_alloc, min_size, u64_to_user_ptr(usr_ptr), usr_size);
>> +	if (ret)
>> +		return ERR_PTR(ret);
>> +
>> +	return_ptr(out_alloc);
>> +}
>> +
>>   /**
>>    * panthor_get_uobj_array() - Copy a user object array into a kernel accessible object array.
>>    * @in: The object array to copy.
>> @@ -176,8 +211,11 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
>>   		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
>>   		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, ringbuf_size), \
>>   		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs), \
>> -		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_info, shader_blocks))
>> -
>> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_info, shader_blocks), \
>> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_setup, shader_enable_mask), \
>> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_start, user_data), \
>> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_stop, user_data), \
>> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_perf_cmd_sample, user_data))
>>   
>>   /**
>>    * PANTHOR_UOBJ_SET() - Copy a kernel object to a user object.
>> @@ -192,6 +230,24 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
>>   			 PANTHOR_UOBJ_MIN_SIZE(_src_obj), \
>>   			 sizeof(_src_obj), &(_src_obj))
>>   
>> +/**
>> + * PANTHOR_UOBJ_GET() - Copies a user object from _usr_ptr to a kernel accessible _dest_ptr.
>> + * @_dest_ptr: Local varialbe
>> + * @_usr_size: Size of the user object.
>> + * @_usr_ptr: The pointer of the object in userspace.
>> + *
>> + * Return: Error code. See panthor_get_uobj().
>> + */
>> +#define PANTHOR_UOBJ_GET(_dest_ptr, _usr_size, _usr_ptr) \
>> +	({ \
>> +		typeof(_dest_ptr) _tmp; \
>> +		_tmp = panthor_get_uobj(_usr_ptr, _usr_size, \
>> +				PANTHOR_UOBJ_MIN_SIZE(_tmp[0])); \
>> +		if (!IS_ERR(_tmp)) \
>> +			_dest_ptr = _tmp; \
>> +		PTR_ERR_OR_ZERO(_tmp); \
>> +	})
>> +
>>   /**
>>    * PANTHOR_UOBJ_GET_ARRAY() - Copy a user object array to a kernel accessible
>>    * object array.
>> @@ -1339,6 +1395,99 @@ static int panthor_ioctl_vm_get_state(struct drm_device *ddev, void *data,
>>   	return 0;
>>   }
>>   
>> +static int panthor_ioctl_perf_control(struct drm_device *ddev, void *data,
>> +		struct drm_file *file)
>> +{
>> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
>> +	struct panthor_file *pfile = file->driver_priv;
>> +	struct drm_panthor_perf_control *args = data;
>> +	int ret;
>> +
>> +	if (!args->pointer) {
>> +		switch (args->cmd) {
>> +		case DRM_PANTHOR_PERF_COMMAND_SETUP:
>> +			args->size = sizeof(struct drm_panthor_perf_cmd_setup);
>> +			return 0;
>> +
>> +		case DRM_PANTHOR_PERF_COMMAND_TEARDOWN:
>> +			args->size = 0;
>> +			return 0;
>> +
>> +		case DRM_PANTHOR_PERF_COMMAND_START:
>> +			args->size = sizeof(struct drm_panthor_perf_cmd_start);
>> +			return 0;
>> +
>> +		case DRM_PANTHOR_PERF_COMMAND_STOP:
>> +			args->size = sizeof(struct drm_panthor_perf_cmd_stop);
>> +			return 0;
>> +
>> +		case DRM_PANTHOR_PERF_COMMAND_SAMPLE:
>> +			args->size = sizeof(struct drm_panthor_perf_cmd_sample);
>> +			return 0;
>> +
>> +		default:
>> +			return -EINVAL;
>> +		}
>> +	}
>> +
>> +	switch (args->cmd) {
>> +	case DRM_PANTHOR_PERF_COMMAND_SETUP:
>> +	{
>> +		struct drm_panthor_perf_cmd_setup *setup_args __free(kvfree) = NULL;
>> +
>> +		ret = PANTHOR_UOBJ_GET(setup_args, args->size, args->pointer);
>> +		if (ret)
>> +			return -EINVAL;
>> +
>> +		if (setup_args->pad[0])
>> +			return -EINVAL;
>> +
>> +		ret = panthor_perf_session_setup(ptdev, ptdev->perf, setup_args, pfile);
> 
> Shouldn't we return the session id as an output param in setup_args or is the
> ioctl's return value enough for this?

Returning it via the ioctl return value has worked for a single session,
but it may have an impact when handling multiple sessions, not sure. Is
there anything done by DRM to normalize the return values from ioctls? I
could not find anything.

> 
>> +
>> +		return ret;
>> +	}
>> +	case DRM_PANTHOR_PERF_COMMAND_TEARDOWN:
>> +	{
>> +		return panthor_perf_session_teardown(pfile, ptdev->perf, args->handle);
>> +	}
>> +	case DRM_PANTHOR_PERF_COMMAND_START:
>> +	{
>> +		struct drm_panthor_perf_cmd_start *start_args __free(kvfree) = NULL;
>> +
>> +		ret = PANTHOR_UOBJ_GET(start_args, args->size, args->pointer);
>> +		if (ret)
>> +			return -EINVAL;
>> +
>> +		return panthor_perf_session_start(pfile, ptdev->perf, args->handle,
>> +				start_args->user_data);
>> +	}
>> +	case DRM_PANTHOR_PERF_COMMAND_STOP:
>> +	{
>> +		struct drm_panthor_perf_cmd_stop *stop_args __free(kvfree) = NULL;
>> +
>> +		ret = PANTHOR_UOBJ_GET(stop_args, args->size, args->pointer);
>> +		if (ret)
>> +			return -EINVAL;
>> +
>> +		return panthor_perf_session_stop(pfile, ptdev->perf, args->handle,
>> +				stop_args->user_data);
>> +	}
>> +	case DRM_PANTHOR_PERF_COMMAND_SAMPLE:
>> +	{
>> +		struct drm_panthor_perf_cmd_sample *sample_args __free(kvfree) = NULL;
>> +
>> +		ret = PANTHOR_UOBJ_GET(sample_args, args->size, args->pointer);
>> +		if (ret)
>> +			return -EINVAL;
>> +
>> +		return panthor_perf_session_sample(pfile, ptdev->perf, args->handle,
>> +					sample_args->user_data);
>> +	}
> 
> For the three cases above, you could define a macro like:
> 
> #define perf_cmd(command)							\
> 	({								\
> 		struct drm_panthor_perf_cmd_##command * command##_args __free(kvfree) = NULL; \
> 									\
> 		ret = PANTHOR_UOBJ_GET(command##_args, args->size, args->pointer); \
> 		if (ret)						\
> 			return -EINVAL;					\
> 		return panthor_perf_session_##command(pfile, ptdev->perf, args->handle, command##_args->user_data); \
> 	})
> 
> 	and then do 'perf_cmd(command);' inside each one of them
> 
>> +	default:
>> +		return -EINVAL;
>> +	}
>> +}
>> +
>>   static int
>>   panthor_open(struct drm_device *ddev, struct drm_file *file)
>>   {
>> @@ -1386,6 +1535,7 @@ panthor_postclose(struct drm_device *ddev, struct drm_file *file)
>>   
>>   	panthor_group_pool_destroy(pfile);
>>   	panthor_vm_pool_destroy(pfile);
>> +	panthor_perf_session_destroy(pfile, pfile->ptdev->perf);
> 
> I would perhaps do this first because pools are first created during file
> opening, just to undo things in the opposite sequence.
>>   
>>   	kfree(pfile);
>>   	module_put(THIS_MODULE);
>> @@ -1408,6 +1558,7 @@ static const struct drm_ioctl_desc panthor_drm_driver_ioctls[] = {
>>   	PANTHOR_IOCTL(TILER_HEAP_CREATE, tiler_heap_create, DRM_RENDER_ALLOW),
>>   	PANTHOR_IOCTL(TILER_HEAP_DESTROY, tiler_heap_destroy, DRM_RENDER_ALLOW),
>>   	PANTHOR_IOCTL(GROUP_SUBMIT, group_submit, DRM_RENDER_ALLOW),
>> +	PANTHOR_IOCTL(PERF_CONTROL, perf_control, DRM_RENDER_ALLOW),
>>   };
>>   
>>   static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
>> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
>> index e0dc6c4b0cf1..6498279ec036 100644
>> --- a/drivers/gpu/drm/panthor/panthor_perf.c
>> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
>> @@ -63,6 +63,40 @@ void panthor_perf_info_init(struct panthor_device *ptdev)
>>   	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
>>   }
>>   
>> +int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
>> +		struct drm_panthor_perf_cmd_setup *setup_args,
>> +		struct panthor_file *pfile)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf,
>> +		u32 sid)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
>> +		u32 sid, u64 user_data)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
>> +		u32 sid, u64 user_data)
>> +{
>> +		return -EOPNOTSUPP;
>> +}
>> +
>> +int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
>> +		u32 sid, u64 user_data)
>> +{
>> +	return -EOPNOTSUPP;
>> +
>> +}
>> +
>> +void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf) { }
>> +
>>   /**
>>    * panthor_perf_init - Initialize the performance counter subsystem.
>>    * @ptdev: Panthor device
>> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
>> index 90af8b18358c..bfef8874068b 100644
>> --- a/drivers/gpu/drm/panthor/panthor_perf.h
>> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
>> @@ -5,11 +5,30 @@
>>   #ifndef __PANTHOR_PERF_H__
>>   #define __PANTHOR_PERF_H__
>>   
>> +#include <linux/types.h>
>> +
>> +struct drm_gem_object;
>> +struct drm_panthor_perf_cmd_setup;
>>   struct panthor_device;
>> +struct panthor_file;
>> +struct panthor_perf;
>>   
>>   void panthor_perf_info_init(struct panthor_device *ptdev);
>>   
>>   int panthor_perf_init(struct panthor_device *ptdev);
>>   void panthor_perf_unplug(struct panthor_device *ptdev);
>>   
>> +int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
>> +		struct drm_panthor_perf_cmd_setup *setup_args,
>> +		struct panthor_file *pfile);
>> +int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf,
>> +		u32 sid);
>> +int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
>> +		u32 sid, u64 user_data);
>> +int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
>> +		u32 sid, u64 user_data);
>> +int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
>> +		u32 sid, u64 user_data);
>> +void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf);
>> +
>>   #endif /* __PANTHOR_PERF_H__ */
>> -- 
>> 2.25.1
> 
> 
> Adrian Larumbe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 5/8] drm/panthor: Introduce sampling sessions to handle userspace clients
  2025-01-27 15:43   ` Adrián Larumbe
@ 2025-03-26 15:14     ` Lukas Zapolskas
  0 siblings, 0 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2025-03-26 15:14 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd



On 27/01/2025 15:43, Adrián Larumbe wrote:
> On 11.12.2024 16:50, Lukas Zapolskas wrote:
>> To allow for combining the requests from multiple userspace clients, an
>> intermediary layer between the HW/FW interfaces and userspace is
>> created, containing the information for the counter requests and
>> tracking of insert and extract indices. Each session starts inactive and
>> must be explicitly activated via PERF_CONTROL.START, and explicitly
>> stopped via PERF_CONTROL.STOP. Userspace identifies a single client with
>> its session ID and the panthor file it is associated with.
>>
>> The SAMPLE and STOP commands both produce a single sample when called,
>> and these samples can be disambiguated via the opaque user data field
>> passed in the PERF_CONTROL uAPI. If this functionality is not desired,
>> these fields can be kept as zero, as the kernel copies this value into
>> the corresponding sample without attempting to interpret it.
>>
>> Currently, only manual sampling sessions are supported, providing
>> samples when userspace calls PERF_CONTROL.SAMPLE, and only a single
>> session is allowed at a time. Multiple sessions and periodic sampling
>> will be enabled in following patches.
>>
>> No protected is provided against the 32-bit hardware counter overflows,
> 
> Spelling: protected

Ack.

>> so for the moment it is up to userspace to ensure that the counters are
>> sampled at a reasonable frequency.
> 
>> The counter set enum is added to the uapi to clarify the restrictions on
>> calling the interface.
>>
>> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
>> ---
>>   drivers/gpu/drm/panthor/panthor_device.h |   3 +
>>   drivers/gpu/drm/panthor/panthor_drv.c    |   1 +
>>   drivers/gpu/drm/panthor/panthor_perf.c   | 697 ++++++++++++++++++++++-
>>   include/uapi/drm/panthor_drm.h           |  50 +-
>>   4 files changed, 732 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
>> index aca33d03036c..9ed1e9aed521 100644
>> --- a/drivers/gpu/drm/panthor/panthor_device.h
>> +++ b/drivers/gpu/drm/panthor/panthor_device.h
>> @@ -210,6 +210,9 @@ struct panthor_file {
>>   	/** @ptdev: Device attached to this file. */
>>   	struct panthor_device *ptdev;
>>   
>> +	/** @drm_file: Corresponding drm_file */
>> +	struct drm_file *drm_file;
> 
> I think you could do away with this member, wrote more about this below.
> 
>>   	/** @vms: VM pool attached to this file. */
>>   	struct panthor_vm_pool *vms;
>>   
>> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
>> index 458175f58b15..2848ab442d10 100644
>> --- a/drivers/gpu/drm/panthor/panthor_drv.c
>> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
>> @@ -1505,6 +1505,7 @@ panthor_open(struct drm_device *ddev, struct drm_file *file)
>>   	}
>>   
>>   	pfile->ptdev = ptdev;
>> +	pfile->drm_file = file;
> 
> Same as above, feel like this is not necessary.
> 
>>   
>>   	ret = panthor_vm_pool_create(pfile);
>>   	if (ret)
>> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
>> index 6498279ec036..42d8b6f8c45d 100644
>> --- a/drivers/gpu/drm/panthor/panthor_perf.c
>> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
>> @@ -3,16 +3,162 @@
>>   /* Copyright 2024 Arm ltd. */
>>   
>>   #include <drm/drm_file.h>
>> +#include <drm/drm_gem.h>
>>   #include <drm/drm_gem_shmem_helper.h>
>>   #include <drm/drm_managed.h>
>> +#include <drm/drm_print.h>
>>   #include <drm/panthor_drm.h>
>>   
>> +#include <linux/circ_buf.h>
>> +#include <linux/iosys-map.h>
>> +#include <linux/pm_runtime.h>
>> +
>>   #include "panthor_device.h"
>>   #include "panthor_fw.h"
>>   #include "panthor_gpu.h"
>>   #include "panthor_perf.h"
>>   #include "panthor_regs.h"
>>   
>> +/**
>> + * PANTHOR_PERF_EM_BITS - Number of bits in a user-facing enable mask. This must correspond
>> + *                        to the maximum number of counters available for selection on the newest
>> + *                        Mali GPUs (128 as of the Mali-Gx15).
>> + */
>> +#define PANTHOR_PERF_EM_BITS (BITS_PER_TYPE(u64) * 2)
>> +
>> +/**
>> + * enum panthor_perf_session_state - Session state bits.
>> + */
>> +enum panthor_perf_session_state {
>> +	/** @PANTHOR_PERF_SESSION_ACTIVE: The session is active and can be used for sampling. */
>> +	PANTHOR_PERF_SESSION_ACTIVE = 0,
>> +
>> +	/**
>> +	 * @PANTHOR_PERF_SESSION_OVERFLOW: The session encountered an overflow in one of the
>> +	 *                                 counters during the last sampling period. This flag
>> +	 *                                 gets propagated as part of samples emitted for this
>> +	 *                                 session, to ensure the userspace client can gracefully
>> +	 *                                 handle this data corruption.
>> +	 */
> 
> How would a client normally deal with data corruption in a sample?

At that point, the client will have to effectively restart a sampling
session (either stop/start pair or re-create the session). In practice,
we haven't seen this happen very much, but the combination of a fast
MCU and a low sampling frequency would have that effect.

>> +	PANTHOR_PERF_SESSION_OVERFLOW,
>> +
>> +	/** @PANTHOR_PERF_SESSION_MAX: Bits needed to represent the state. Must be last.*/
>> +	PANTHOR_PERF_SESSION_MAX,
>> +};
>> +
>> +struct panthor_perf_enable_masks {
>> +	/**
>> +	 * @link: List node used to keep track of the enable masks aggregated by the sampler.
>> +	 */
>> +	struct list_head link;
>> +
>> +	/** @refs: Number of references taken out on an instantiated enable mask. */
>> +	struct kref refs;
>> +
>> +	/**
>> +	 * @mask: Array of bitmasks indicating the counters userspace requested, where
>> +	 *        one bit represents a single counter. Used to build the firmware configuration
>> +	 *        and ensure that userspace clients obtain only the counters they requested.
>> +	 */
>> +	DECLARE_BITMAP(mask, PANTHOR_PERF_EM_BITS)[DRM_PANTHOR_PERF_BLOCK_MAX];
>> +};
>> +
>> +struct panthor_perf_counter_block {
>> +	struct drm_panthor_perf_block_header header;
>> +	u64 counters[];
>> +};
> 
> I think I remember reading in the spec thata block header was 12 bytes in length
> but the one defined here seems to have many more fields.

Two different block headers: I believe you are talking about the header
coming in the block sample. This is an additional header synthesized in
the kernel to provide more context about the block.

> 
>> +struct panthor_perf_session {
>> +	DECLARE_BITMAP(state, PANTHOR_PERF_SESSION_MAX);
> 
> I'm wondering, because I don't remember having seen this pattern before.
> Is it common in kernel code to declare bitmaps for masks of enum values in this way?
> 

I saw it in several places, including drivers/net/ethernet/sfc/
ef100_nic.h (the first example I found when looking).

>> +	/**
>> +	 * @user_sample_size: The size of a single sample as exposed to userspace. For the sake of
>> +	 *                    simplicity, the current implementation exposes the same structure
>> +	 *                    as provided by firmware, after annotating the sample and the blocks,
>> +	 *                    and zero-extending the counters themselves (to account for in-kernel
>> +	 *                    accumulation).
>> +	 *
>> +	 *                    This may also allow further memory-optimizations of compressing the
>> +	 *                    sample to provide only requested blocks, if deemed to be worth the
>> +	 *                    additional complexity.
>> +	 */
>> +	size_t user_sample_size;
>> +
>> +	/**
>> +	 * @sample_freq_ns: Period between subsequent sample requests. Zero indicates that
>> +	 *                  userspace will be responsible for requesting samples.
>> +	 */
>> +	u64 sample_freq_ns;
>> +
>> +	/** @sample_start_ns: Sample request time, obtained from a monotonic raw clock. */
>> +	u64 sample_start_ns;
>> +
>> +	/**
>> +	 * @user_data: Opaque handle passed in when starting a session, requesting a sample (for
>> +	 *             manual sampling sessions only) and when stopping a session. This handle
>> +	 *             allows the disambiguation of a sample in the ringbuffer.
>> +	 */
>> +	u64 user_data;
>> +
>> +	/**
>> +	 * @eventfd: Event file descriptor context used to signal userspace of a new sample
>> +	 *           being emitted.
>> +	 */
>> +	struct eventfd_ctx *eventfd;
>> +
>> +	/**
>> +	 * @enabled_counters: This session's requested counters. Note that these cannot change
>> +	 *                    for the lifetime of the session.
>> +	 */
>> +	struct panthor_perf_enable_masks *enabled_counters;
> 
> It seems the enable mask for a session is tied to the session's lifetime. In
> panthor_perf_session_setup(), you create one and then increase its reference
> count from within panthor_perf_sampler_add(), which is not being called from
> anywhere else. Maybe in that case you could do without the reference count and
> have a non-pointer struct panthor_perf_enable_masks member here?
> 

When working on the multi-client handling, that is exactly what I found
as well. The reference count was overcomplicating the lifetime
management of it.

>> +	/** @ringbuf_slots: Slots in the user-facing ringbuffer. */
>> +	size_t ringbuf_slots;
>> +
>> +	/** @ring_buf: BO for the userspace ringbuffer. */
>> +	struct drm_gem_object *ring_buf;
>> +
>> +	/**
>> +	 * @control_buf: BO for the insert and extract indices.
>> +	 */
>> +	struct drm_gem_object *control_buf;
>> +
>> +	/**
>> +	 * @extract_idx: The extract index is used by userspace to indicate the position of the
>> +	 *               consumer in the ringbuffer.
>> +	 */
>> +	u32 *extract_idx;
>> +
>> +	/**
>> +	 * @insert_idx: The insert index is used by the kernel to indicate the position of the
>> +	 *              latest sample exposed to userspace.
>> +	 */
>> +	u32 *insert_idx;
>> +
>> +	/** @samples: The mapping of the @ring_buf into the kernel's VA space. */
>> +	u8 *samples;
>> +
>> +	/**
>> +	 * @waiting: The list node used by the sampler to track the sessions waiting for a sample.
>> +	 */
>> +	struct list_head waiting;
>> +
>> +	/**
>> +	 * @pfile: The panthor file which was used to create a session, used for the postclose
>> +	 *         handling and to prevent a misconfigured userspace from closing unrelated
>> +	 *         sessions.
>> +	 */
>> +	struct panthor_file *pfile;
>> +
>> +	/**
>> +	 * @ref: Session reference count. The sample delivery to userspace is asynchronous, meaning
>> +	 *       the lifetime of the session must extend at least until the sample is exposed to
>> +	 *       userspace.
>> +	 */
>> +	struct kref ref;
>> +};
>> +
>> +
>>   struct panthor_perf {
>>   	/**
>>   	 * @block_set: The global counter set configured onto the HW.
>> @@ -63,39 +209,154 @@ void panthor_perf_info_init(struct panthor_device *ptdev)
>>   	perf_info->shader_blocks = hweight64(ptdev->gpu_info.shader_present);
>>   }
>>   
>> -int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
>> -		struct drm_panthor_perf_cmd_setup *setup_args,
>> -		struct panthor_file *pfile)
>> +static struct panthor_perf_enable_masks *panthor_perf_em_new(void)
>>   {
>> -	return -EOPNOTSUPP;
>> +	struct panthor_perf_enable_masks *em = kmalloc(sizeof(*em), GFP_KERNEL);
>> +
>> +	if (!em)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	INIT_LIST_HEAD(&em->link);
>> +
>> +	kref_init(&em->refs);
>> +
>> +	return em;
>>   }
>>   
>> -int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf,
>> -		u32 sid)
>> +static struct panthor_perf_enable_masks *panthor_perf_create_em(struct drm_panthor_perf_cmd_setup
>> +		*setup_args)
>>   {
>> -	return -EOPNOTSUPP;
>> +	struct panthor_perf_enable_masks *em = panthor_perf_em_new();
>> +
>> +	if (IS_ERR_OR_NULL(em))
>> +		return em;
>> +
>> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_FW],
>> +			setup_args->fw_enable_mask, PANTHOR_PERF_EM_BITS);
>> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_CSG],
>> +			setup_args->csg_enable_mask, PANTHOR_PERF_EM_BITS);
>> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_CSHW],
>> +			setup_args->cshw_enable_mask, PANTHOR_PERF_EM_BITS);
>> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_TILER],
>> +			setup_args->tiler_enable_mask, PANTHOR_PERF_EM_BITS);
>> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_MEMSYS],
>> +			setup_args->memsys_enable_mask, PANTHOR_PERF_EM_BITS);
>> +	bitmap_from_arr64(em->mask[DRM_PANTHOR_PERF_BLOCK_SHADER],
>> +			setup_args->shader_enable_mask, PANTHOR_PERF_EM_BITS);
> 
> To save some repetition, maybe do this, although it might depend on uAPI
> structures being arranged in the right way, and the compiler not inserting
> unusual padding between consecutive members:
> 
> unsigned int block; u64 *mask;
> for (mask = &setup_args->fw_enable_mask[0], block = DRM_PANTHOR_PERF_BLOCK_FW;
>       block < DRM_PANTHOR_PERF_BLOCK_LAST; block++, mask += 2)
> 	bitmap_from_arr64(em->mask[block], mask, PANTHOR_PERF_EM_BITS);
> 
>> +	return em;
>>   }
>>

I think that would work, but I think we may strike the right balance
between repetition and maintainability here. When the enable masks are
converted from userspace, we can use the enum values to iterate through
them, saving a lot of repetition elsewhere.

>> -int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
>> -		u32 sid, u64 user_data)
>> +static void panthor_perf_destroy_em_kref(struct kref *em_kref)
>>   {
>> -	return -EOPNOTSUPP;
>> +	struct panthor_perf_enable_masks *em = container_of(em_kref, typeof(*em), refs);
>> +
>> +	if (!list_empty(&em->link))
>> +		return;
> 
> Could this lead to a situation where the enable mask's refcnt reaches 0,
> but because it hadn't yet been removed from the session's list, the
> mask object is never freed?

Hopefully not, but will be dropping the enable mask refcount, since it's
not necessary.

>> +	kfree(em);
>>   }
>>   
>> -int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
>> -		u32 sid, u64 user_data)
>> +static size_t get_annotated_block_size(size_t counters_per_block)
>>   {
>> -		return -EOPNOTSUPP;
>> +	return struct_size_t(struct panthor_perf_counter_block, counters, counters_per_block);
>>   }
>>   
>> -int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
>> -		u32 sid, u64 user_data)
>> +static u32 session_read_extract_idx(struct panthor_perf_session *session)
>> +{
>> +	/* Userspace will update their own extract index to indicate that a sample is consumed
>> +	 * from the ringbuffer, and we must ensure we read the latest value.
>> +	 */
>> +	return smp_load_acquire(session->extract_idx);
>> +}
>> +
>> +static u32 session_read_insert_idx(struct panthor_perf_session *session)
>> +{
>> +	return *session->insert_idx;
>> +}
>> +
>> +static void session_get(struct panthor_perf_session *session)
>> +{
>> +	kref_get(&session->ref);
>> +}
>> +
>> +static void session_free(struct kref *ref)
>> +{
>> +	struct panthor_perf_session *session = container_of(ref, typeof(*session), ref);
>> +
>> +	if (session->samples) {
>> +		struct iosys_map map = IOSYS_MAP_INIT_VADDR(session->samples);
>> +
>> +		drm_gem_vunmap_unlocked(session->ring_buf, &map);
>> +		drm_gem_object_put(session->ring_buf);
>> +	}
>> +
>> +	if (session->insert_idx && session->extract_idx) {
> 
> I think none of these could ever be NULLif session setup succeeded.
> 
>> +		struct iosys_map map = IOSYS_MAP_INIT_VADDR(session->extract_idx);
>> +
>> +		drm_gem_vunmap_unlocked(session->control_buf, &map);
>> +		drm_gem_object_put(session->control_buf);
>> +	}
>> +
>> +	kref_put(&session->enabled_counters->refs, panthor_perf_destroy_em_kref);
>> +	eventfd_ctx_put(session->eventfd);
>> +
>> +	devm_kfree(session->pfile->ptdev->base.dev, session);
> 
> What is the point of using devm allocations in this case, if we always free
> the session manually?

Not useful in this case, will drop the devm_ in the next revision.

>> +}
>> +
>> +static void session_put(struct panthor_perf_session *session)
>> +{
>> +	kref_put(&session->ref, session_free);
>> +}
>> +
>> +/**
>> + * session_find - Find a session associated with the given session ID and
>> + *                panthor_file.
>> + * @pfile: Panthor file.
>> + * @perf: Panthor perf.
>> + * @sid: Session ID.
>> + *
>> + * The reference count of a valid session is increased to ensure it does not disappear
>> + * in the window between the XA lock being dropped and the internal session functions
>> + * being called.
>> + *
>> + * Return: valid session pointer or an ERR_PTR.
>> + */
>> +static struct panthor_perf_session *session_find(struct panthor_file *pfile,
>> +		struct panthor_perf *perf, u32 sid)
>>   {
>> -	return -EOPNOTSUPP;
>> +	struct panthor_perf_session *session;
>>   
>> +	if (!perf)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	xa_lock(&perf->sessions);
>> +	session = xa_load(&perf->sessions, sid);
>> +
>> +	if (!session || xa_is_err(session)) {
>> +		xa_unlock(&perf->sessions);
>> +		return ERR_PTR(-EBADF);
> 
> I think we should return NULL in case !session holds true, for panthor_perf_session_start to catch it and return -EINVAL.
> 

What would be the reason to prefer -EINVAL in this case over -EBADF? I'm
not sure if this is standard, but the intent was to provide unique-ish
error codes to help userspace debug potential issues on their side.

>> +	}
>> +
>> +	if (session->pfile != pfile) {
>> +		xa_unlock(&perf->sessions);
>> +		return ERR_PTR(-EINVAL);
>> +	}
>> +
>> +	session_get(session);
>> +	xa_unlock(&perf->sessions);
>> +
>> +	return session;
>>   }
>>   
>> -void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf) { }
>> +static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *const info)
> 
> Since this seems to be the size of the sample given to UM, maybe renaming it to
> contain _user_ would make its purpose more apparent.
> 

Will do.

>> +{
>> +	const size_t block_size = get_annotated_block_size(info->counters_per_block);
>> +	const size_t block_nr = info->cshw_blocks + info->csg_blocks + info->fw_blocks +
>> +		info->tiler_blocks + info->memsys_blocks + info->shader_blocks;
>> +
>> +	return sizeof(struct drm_panthor_perf_sample_header) + (block_size * block_nr);
>> +}
>>   
>>   /**
>>    * panthor_perf_init - Initialize the performance counter subsystem.
>> @@ -130,6 +391,399 @@ int panthor_perf_init(struct panthor_device *ptdev)
>>   	ptdev->perf = perf;
>>   
>>   	return 0;
>> +
>> +}
>> +
>> +static int session_validate_set(u8 set)
>> +{
>> +	if (set > DRM_PANTHOR_PERF_SET_TERTIARY)
>> +		return -EINVAL;
>> +
>> +	if (set == DRM_PANTHOR_PERF_SET_PRIMARY)
>> +		return 0;
>> +
>> +	if (set > DRM_PANTHOR_PERF_SET_PRIMARY)
>> +		return capable(CAP_PERFMON) ? 0 : -EACCES;
> 
> I'm a bit clueless about the capability API, so I don't quite understand how
> this is the way whe decide whether a counter set is legal.

There are two different notions of validity here: this validates
the request from userspace. In the context of multiple clients,
the likelihood is that all of the clients are primarily interested
in the first set. In that scenario, a single client can deny
access to the primary set to all other clients, so the request
is restricted to only those processes that have elevated
capabilities.

In terms of what sets are valid coming from the hardware,
neither the hardware nor the firmware provide any information.

>> +	return -EINVAL;
>> +}
>> +
>> +/**
>> + * panthor_perf_session_setup - Create a user-visible session.
>> + *
>> + * @ptdev: Handle to the panthor device.
>> + * @perf: Handle to the perf control structure.
>> + * @setup_args: Setup arguments passed in via ioctl.
>> + * @pfile: Panthor file associated with the request.
>> + *
>> + * Creates a new session associated with the session ID returned. When initialized, the
>> + * session must explicitly request sampling to start with a successive call to PERF_CONTROL.START.
>> + *
>> + * Return: non-negative session identifier on success or negative error code on failure.
>> + */
>> +int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
>> +		struct drm_panthor_perf_cmd_setup *setup_args,
>> +		struct panthor_file *pfile)
>> +{
>> +	struct panthor_perf_session *session;
>> +	struct drm_gem_object *ringbuffer;
>> +	struct drm_gem_object *control;
>> +	const size_t slots = setup_args->sample_slots;
>> +	struct panthor_perf_enable_masks *em;
>> +	struct iosys_map rb_map, ctrl_map;
>> +	size_t user_sample_size;
>> +	int session_id;
>> +	int ret;
>> +
>> +	ret = session_validate_set(setup_args->block_set);
>> +	if (ret)
>> +		return ret;
>> +
>> +	session = devm_kzalloc(ptdev->base.dev, sizeof(*session), GFP_KERNEL);
>> +	if (ZERO_OR_NULL_PTR(session))
>> +		return -ENOMEM;
>> +
>> +	ringbuffer = drm_gem_object_lookup(pfile->drm_file, setup_args->ringbuf_handle);
>> +	if (!ringbuffer) {
>> +		ret = -EINVAL;
>> +		goto cleanup_session;
>> +	}
> 
> I guess this would never be the same ringbuffer we created in
> panthor_perf_sampler_init(). I don't think it can be, because that one was is
> created as a kernel bo, and has no public facing GEM handler. But then this
> means we're doing a copy between the FW ringbuffer and the user-supplied
> one. However, I remember from past conversations that the goal of a new
> implementation was to avoid doing many perfcnt sample copies between the kernel
> and UM, because that would require a huge bandwith when the sample period is
> small.

Rather than doing any copies, we want to be able to avoid as many copies
as possible. When multi-client comes into the picture, there has to be
at least one copy into the staging buffer (FW -> kernel) and either one
copy to every session or one accumulation for every session (kernel ->
user). Going this route, rather than copy everything every time, we
try to elide as many of the memory operations as possible, including
skipping empty shader core counter blocks, and doing on-demand
accumulation into existing sample slots.

> 
>> +	control = drm_gem_object_lookup(pfile->drm_file, setup_args->control_handle);
>> +	if (!control) {
>> +		ret = -EINVAL;
>> +		goto cleanup_ringbuf;
>> +	}
>> +
>> +	user_sample_size = session_get_max_sample_size(&ptdev->perf_info) * slots;
>> +
>> +	if (ringbuffer->size != PFN_ALIGN(user_sample_size)) {
>> +		ret = -ENOMEM;
>> +		goto cleanup_control;
>> +	}
> 
> How is information about the max sample size given to UM? I guess through the
> values returned through the getparam ioctl(), specifically sample_header_size
> and block_header_size?
> 

Effectively, yes.

>> +	ret = drm_gem_vmap_unlocked(ringbuffer, &rb_map);
>> +	if (ret)
>> +		goto cleanup_control;
>> +
>> +
>> +	ret = drm_gem_vmap_unlocked(control, &ctrl_map);
>> +	if (ret)
>> +		goto cleanup_ring_map;
>> +
>> +	session->eventfd = eventfd_ctx_fdget(setup_args->fd);
>> +	if (IS_ERR_OR_NULL(session->eventfd)) {
>> +		ret = PTR_ERR_OR_ZERO(session->eventfd) ?: -EINVAL;
>> +		goto cleanup_control_map;
>> +	}
> 
> I think eventfd_ctx_fdget can only return error values, so there's no need to
> check for NULL or ZERO.

You're right, will need to fix this.
> 
>> +	em = panthor_perf_create_em(setup_args);
>> +	if (IS_ERR_OR_NULL(em)) {
>> +		ret = -ENOMEM;
>> +		goto cleanup_eventfd;
>> +	}
>> +
>> +	INIT_LIST_HEAD(&session->waiting);
>> +	session->extract_idx = ctrl_map.vaddr;
>> +	*session->extract_idx = 0;
>> +	session->insert_idx = session->extract_idx + 1;
>> +	*session->insert_idx = 0;
>> +
>> +	session->samples = rb_map.vaddr;
> 
> I think you might've forgotten this:
> 
>          session->ringbuf_slots = slots;
> 

I did, caught it in internal testing later.

>> +	/* TODO This will need validation when we support periodic sampling sessions */
>> +	if (setup_args->sample_freq_ns) {
>> +		ret = -EOPNOTSUPP;
>> +		goto cleanup_em;
>> +	}
>> +
>> +	session->sample_freq_ns = setup_args->sample_freq_ns;
>> +	session->user_sample_size = user_sample_size;
>> +	session->enabled_counters = em;
>> +	session->ring_buf = ringbuffer;
>> +	session->control_buf = control;
>> +	session->pfile = pfile;
>> +
>> +	ret = xa_alloc_cyclic(&perf->sessions, &session_id, session, perf->session_range,
>> +			&perf->next_session, GFP_KERNEL);
> 
> What do we need the next_session index for?

When next calling xa_alloc_cyclic, it starts searching for a valid
session ID from the next_session index.

>> +	if (ret < 0)
>> +		goto cleanup_em;
>> +
>> +	kref_init(&session->ref);
>> +
>> +	return session_id;
>> +
>> +cleanup_em:
>> +	kref_put(&em->refs, panthor_perf_destroy_em_kref);
>> +
>> +cleanup_eventfd:
>> +	eventfd_ctx_put(session->eventfd);
>> +
>> +cleanup_control_map:
>> +	drm_gem_vunmap_unlocked(control, &ctrl_map);
>> +
>> +cleanup_ring_map:
>> +	drm_gem_vunmap_unlocked(ringbuffer, &rb_map);
>> +
>> +cleanup_control:
>> +	drm_gem_object_put(control);
>> +
>> +cleanup_ringbuf:
>> +	drm_gem_object_put(ringbuffer);
>> +
>> +cleanup_session:
>> +	devm_kfree(ptdev->base.dev, session);
>> +
>> +	return ret;
>> +}
>> +
>> +static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *session,
>> +		u64 user_data)
>> +{
>> +	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
>> +		return 0;
>> +
>> +	const u32 extract_idx = session_read_extract_idx(session);
>> +	const u32 insert_idx = session_read_insert_idx(session);
>> +
>> +	/* Must have at least one slot remaining in the ringbuffer to sample. */
>> +	if (WARN_ON_ONCE(!CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots)))
>> +		return -EBUSY;
>> +
>> +	session->user_data = user_data;
>> +
>> +	clear_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
>> +
>> +	/* TODO Calls to the FW interface will go here in later patches. */
>> +	return 0;
>> +}
>> +
>> +static int session_start(struct panthor_perf *perf, struct panthor_perf_session *session,
>> +		u64 user_data)
>> +{
>> +	if (test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
>> +		return 0;
>> +
>> +	set_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
>> +
>> +	/*
>> +	 * For manual sampling sessions, a start command does not correspond to a sample,
>> +	 * and so the user data gets discarded.
>> +	 */
>> +	if (session->sample_freq_ns)
>> +		session->user_data = user_data;
>> +
>> +	/* TODO Calls to the FW interface will go here in later patches. */
>> +	return 0;
>> +}
>> +
>> +static int session_sample(struct panthor_perf *perf, struct panthor_perf_session *session,
>> +		u64 user_data)
>> +{
>> +	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
>> +		return -EACCES;
>> +
>> +	const u32 extract_idx = session_read_extract_idx(session);
>> +	const u32 insert_idx = session_read_insert_idx(session);
>> +
>> +	/* Manual sampling for periodic sessions is forbidden. */
>> +	if (session->sample_freq_ns)
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * Must have at least two slots remaining in the ringbuffer to sample: one for
>> +	 * the current sample, and one for a stop sample, since a stop command should
>> +	 * always be acknowledged by taking a final sample and stopping the session.
>> +	 */
>> +	if (CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots) < 2)
>> +		return -EBUSY;
>> +
>> +	session->sample_start_ns = ktime_get_raw_ns();
>> +	session->user_data = user_data;
>> +
>> +	/* TODO Calls to the FW interface will go here in later patches. */
>> +	return 0;
>> +}
>> +
>> +static int session_destroy(struct panthor_perf *perf, struct panthor_perf_session *session)
>> +{
>> +	session_put(session);
>> +
>> +	return 0;
>> +}
>> +
>> +static int session_teardown(struct panthor_perf *perf, struct panthor_perf_session *session)
>> +{
>> +	if (test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
>> +		return -EINVAL;
>> +
>> +	if (!list_empty(&session->waiting))
>> +		return -EBUSY;
>> +
>> +	return session_destroy(perf, session);
>> +}
>> +
>> +/**
>> + * panthor_perf_session_teardown - Teardown the session associated with the @sid.
>> + * @pfile: Open panthor file.
>> + * @perf: Handle to the perf control structure.
>> + * @sid: Session identifier.
>> + *
>> + * Destroys a stopped session where the last sample has been explicitly consumed
>> + * or discarded. Active sessions will be ignored.
>> + *
>> + * Return: 0 on success, negative error code on failure.
>> + */
>> +int panthor_perf_session_teardown(struct panthor_file *pfile, struct panthor_perf *perf, u32 sid)
>> +{
>> +	int err;
>> +	struct panthor_perf_session *session;
>> +
>> +	xa_lock(&perf->sessions);
>> +	session = __xa_store(&perf->sessions, sid, NULL, GFP_KERNEL);
> 
> Why not xa_erase() here instead?

To avoid race conditions with session_get/session_put. This way, we
serialize the session handling by using the XArray lock.

> 
>> +	if (xa_is_err(session)) {
>> +		err = xa_err(session);
>> +		goto restore;
>> +	}
>> +
>> +	if (session->pfile != pfile) {
>> +		err = -EINVAL;
>> +		goto restore;
>> +	}
>> +
>> +	session_get(session);
>> +	xa_unlock(&perf->sessions);
>> +
>> +	err = session_teardown(perf, session);
>> +
>> +	session_put(session);
> 
> I haven't made sure that reference counting is balanced, but noticed session_teardown() is already
> putting the session's kref. I'll have a deeper look into it later on.
> 

The idea behind the reference counting scheem was as follows:
- session_create() takes a reference
- session_teardown() drops that reference
- session_{start,stop,sample}() take a reference for the duration
   of the function call.
- when requesting a sample, session_sample() takes an additional
   reference to put the session onto the list of sessions waiting
   for samples.
- the irq bottom half then puts the reference after emitting
   the data.

This way, even if a user terminates the session while it is
actively waiting for a sample, the lifetime is handled
gracefully.

>> +
>> +	return err;
>> +
>> +restore:
>> +	__xa_store(&perf->sessions, sid, session, GFP_KERNEL);
>> +	xa_unlock(&perf->sessions);
>> +
>> +	return err;
>> +}
>> +
>> +/**
>> + * panthor_perf_session_start - Start sampling on a stopped session.
>> + * @pfile: Open panthor file.
>> + * @perf: Handle to the panthor perf control structure.
>> + * @sid: Session identifier for the desired session.
>> + * @user_data: An opaque value passed in from userspace.
>> + *
>> + * A session counts as stopped when it is created or when it is explicitly stopped after being
>> + * started. Starting an active session is treated as a no-op.
>> + *
>> + * The @user_data parameter will be associated with all subsequent samples for a periodic
>> + * sampling session and will be ignored for manual sampling ones in favor of the user data
>> + * passed in the PERF_CONTROL.SAMPLE ioctl call.
>> + *
>> + * Return: 0 on success, negative error code on failure.
>> + */
>> +int panthor_perf_session_start(struct panthor_file *pfile, struct panthor_perf *perf,
>> +		u32 sid, u64 user_data)
>> +{
>> +	struct panthor_perf_session *session = session_find(pfile, perf, sid);
>> +	int err;
>> +
>> +	if (IS_ERR_OR_NULL(session))
>> +		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
>> +
>> +	err = session_start(perf, session, user_data);
>> +
>> +	session_put(session);
>> +
>> +	return err;
>> +}
>> +
>> +/**
>> + * panthor_perf_session_stop - Stop sampling on an active session.
>> + * @pfile: Open panthor file.
>> + * @perf: Handle to the panthor perf control structure.
>> + * @sid: Session identifier for the desired session.
>> + * @user_data: An opaque value passed in from userspace.
>> + *
>> + * A session counts as active when it has been explicitly started via the PERF_CONTROL.START
>> + * ioctl. Stopping a stopped session is treated as a no-op.
>> + *
>> + * To ensure data is not lost when sampling is stopping, there must always be at least one slot
>> + * available for the final automatic sample, and the stop command will be rejected if there is not.
>> + *
>> + * The @user_data will always be associated with the final sample.
>> + *
>> + * Return: 0 on success, negative error code on failure.
>> + */
>> +int panthor_perf_session_stop(struct panthor_file *pfile, struct panthor_perf *perf,
>> +		u32 sid, u64 user_data)
>> +{
>> +	struct panthor_perf_session *session = session_find(pfile, perf, sid);
>> +	int err;
>> +
>> +	if (IS_ERR_OR_NULL(session))
>> +		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
>> +
>> +	err = session_stop(perf, session, user_data);
>> +
>> +	session_put(session);
>> +
>> +	return err;
>> +}
>> +
>> +/**
>> + * panthor_perf_session_sample - Request a sample on a manual sampling session.
>> + * @pfile: Open panthor file.
>> + * @perf: Handle to the panthor perf control structure.
>> + * @sid: Session identifier for the desired session.
>> + * @user_data: An opaque value passed in from userspace.
>> + *
>> + * Only an active manual sampler is permitted to request samples directly. Failing to meet either
>> + * of these conditions will cause the sampling request to be rejected. Requesting a manual sample
>> + * with a full ringbuffer will see the request being rejected.
>> + *
>> + * The @user_data will always be unambiguously associated one-to-one with the resultant sample.
>> + *
>> + * Return: 0 on success, negative error code on failure.
>> + */
>> +int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf *perf,
>> +		u32 sid, u64 user_data)
>> +{
>> +	struct panthor_perf_session *session = session_find(pfile, perf, sid);
>> +	int err;
>> +
>> +	if (IS_ERR_OR_NULL(session))
>> +		return IS_ERR(session) ? PTR_ERR(session) : -EINVAL;
>> +
>> +	err = session_sample(perf, session, user_data);
>> +
>> +	session_put(session);
>> +
>> +	return err;
>> +}
>> +
>> +/**
>> + * panthor_perf_session_destroy - Destroy a sampling session associated with the @pfile.
>> + * @perf: Handle to the panthor perf control structure.
>> + * @pfile: The file being closed.
>> + *
>> + * Must be called when the corresponding userspace process is destroyed and cannot close its
>> + * own sessions. As such, we offer no guarantees about data delivery.
>> + */
>> +void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf)
>> +{
>> +	unsigned long sid;
>> +	struct panthor_perf_session *session;
>> +
>> +	xa_for_each(&perf->sessions, sid, session)
>> +	{
>> +		if (session->pfile == pfile) {
>> +			session_destroy(perf, session);
>> +			xa_erase(&perf->sessions, sid);
>> +		}
>> +	}
>>   }
>>   
>>   /**
>> @@ -146,10 +800,17 @@ void panthor_perf_unplug(struct panthor_device *ptdev)
>>   	if (!perf)
>>   		return;
>>   
>> -	if (!xa_empty(&perf->sessions))
>> +	if (!xa_empty(&perf->sessions)) {
>> +		unsigned long sid;
>> +		struct panthor_perf_session *session;
>> +
>>   		drm_err(&ptdev->base,
>>   				"Performance counter sessions active when unplugging the driver!");
>>   
>> +		xa_for_each(&perf->sessions, sid, session)
>> +			session_destroy(perf, session);
>> +	}
>> +
>>   	xa_destroy(&perf->sessions);
>>   
>>   	devm_kfree(ptdev->base.dev, ptdev->perf);
>> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
>> index 8a431431da6b..576d3ad46e6d 100644
>> --- a/include/uapi/drm/panthor_drm.h
>> +++ b/include/uapi/drm/panthor_drm.h
>> @@ -458,6 +458,12 @@ enum drm_panthor_perf_block_type {
>>   
>>   	/** @DRM_PANTHOR_PERF_BLOCK_SHADER: A shader core counter block. */
>>   	DRM_PANTHOR_PERF_BLOCK_SHADER,
>> +
>> +	/** @DRM_PANTHOR_PERF_BLOCK_LAST: Internal use only. */
>> +	DRM_PANTHOR_PERF_BLOCK_LAST = DRM_PANTHOR_PERF_BLOCK_SHADER,
>> +
>> +	/** @DRM_PANTHOR_PERF_BLOCK_MAX: Internal use only. */
>> +	DRM_PANTHOR_PERF_BLOCK_MAX = DRM_PANTHOR_PERF_BLOCK_LAST + 1,
>>   };
>>   
>>   /**
>> @@ -1368,6 +1374,44 @@ struct drm_panthor_perf_control {
>>   	__u64 pointer;
>>   };
>>   
>> +/**
>> + * enum drm_panthor_perf_counter_set - The counter set to be requested from the hardware.
>> + *
>> + * The hardware supports a single performance counter set at a time, so requesting any set other
>> + * than the primary may fail if another process is sampling at the same time.
>> + *
>> + * If in doubt, the primary counter set has the most commonly used counters and requires no
>> + * additional permissions to open.
>> + */
>> +enum drm_panthor_perf_counter_set {
>> +	/**
>> +	 * @DRM_PANTHOR_PERF_SET_PRIMARY: The default set configured on the hardware.
>> +	 *
>> +	 * This is the only set for which all counters in all blocks are defined.
>> +	 */
>> +	DRM_PANTHOR_PERF_SET_PRIMARY,
>> +
>> +	/**
>> +	 * @DRM_PANTHOR_PERF_SET_SECONDARY: The secondary performance counter set.
>> +	 *
>> +	 * Some blocks may not have any defined counters for this set, and the block will
>> +	 * have the UNAVAILABLE block state permanently set in the block header.
>> +	 *
>> +	 * Accessing this set requires the calling process to have the CAP_PERFMON capability.
>> +	 */
>> +	DRM_PANTHOR_PERF_SET_SECONDARY,
>> +
>> +	/**
>> +	 * @DRM_PANTHOR_PERF_SET_TERTIARY: The tertiary performance counter set.
>> +	 *
>> +	 * Some blocks may not have any defined counters for this set, and the block will have
>> +	 * the UNAVAILABLE block state permanently set in the block header. Note that the
>> +	 * tertiary set has the fewest defined counter blocks.
>> +	 *
>> +	 * Accessing this set requires the calling process to have the CAP_PERFMON capability.
>> +	 */
>> +	DRM_PANTHOR_PERF_SET_TERTIARY,
>> +};
>>   
>>   /**
>>    * struct drm_panthor_perf_cmd_setup - Arguments passed to DRM_PANTHOR_IOCTL_PERF_CONTROL
>> @@ -1375,13 +1419,17 @@ struct drm_panthor_perf_control {
>>    */
>>   struct drm_panthor_perf_cmd_setup {
>>   	/**
>> -	 * @block_set: Set of performance counter blocks.
>> +	 * @block_set: Set of performance counter blocks, member of
>> +	 *             enum drm_panthor_perf_block_set.
>>   	 *
>>   	 * This is a global configuration and only one set can be active at a time. If
>>   	 * another client has already requested a counter set, any further requests
>>   	 * for a different counter set will fail and return an -EBUSY.
>>   	 *
>>   	 * If the requested set does not exist, the request will fail and return an -EINVAL.
>> +	 *
>> +	 * Some sets have additional requirements to be enabled, and the setup request will
>> +	 * fail with an -EACCES if these requirements are not satisfied.
>>   	 */
> 
> Is this what we check inside session_validate_set() ?

That's right.

>>   	__u8 block_set;
>>   
>> -- 
>> 2.25.1
> 
> 
> Adrian Larumbe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 6/8] drm/panthor: Implement the counter sampler and sample handling
  2025-01-27 16:53   ` Adrián Larumbe
@ 2025-03-27  8:53     ` Lukas Zapolskas
  0 siblings, 0 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2025-03-27  8:53 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd



On 27/01/2025 16:53, Adrián Larumbe wrote:
> On 11.12.2024 16:50, Lukas Zapolskas wrote:
>> From: Adrián Larumbe <adrian.larumbe@collabora.com>
>>
>> The sampler aggregates counter and set requests coming from userspace
>> and mediates interactions with the FW interface, to ensure that user
>> sessions cannot override the global configuration.
>>
>>  From the top-level interface, the sampler supports two different types
>> of samples: clearing samples and regular samples. Clearing samples are
>> a special sample type that allow for the creation of a sampling
>> baseline, to ensure that a session does not obtain counter data from
>> before its creation.
>>
>> Upon receipt of a relevant interrupt, corresponding to one of the three
>> relevant bits of the GLB_ACK register, the sampler takes any samples
>> that occurred, and, based on the insert and extract indices, accumulates
>> them to an internal storage buffer after zero-extending the counters
>> from the 32-bit counters emitted by the hardware to 64-bit counters
>> for internal accumulation.
>>
>> When the performance counters are enabled, the FW ensures no counter
>> data is lost when entering and leaving non-counting regions by producing
>> automatic samples that do not correspond to a GLB_REQ.PRFCNT_SAMPLE
>> request. Such regions may be per hardware unit, such as when a shader
>> core powers down, or global. Most of these events do not directly
>> correspond to session sample requests, so any intermediary counter data
>> must be stored into a temporary accumulation buffer.
>>
>> If there are sessions waiting for a sample, this accumulated buffer will
>> be taken, and emitted for each waiting client. During this phase,
>> information like the timestamps of sample request and sample emission,
>> type of the counter block and block index annotations are added to the
>> sample header and block headers. If no sessions are waiting for
>> a sample, this accumulation buffer is kept until the next time a sample
>> is requested.
>>
>> Special handling is needed for the PRFCNT_OVERFLOW interrupt, which is
>> an indication that the internal sample handling rate was insufficient.
>>
>> The sampler also maintains a buffer descriptor indicating the structure
>> of a firmware sample, since neither the firmware nor the hardware give
>> any indication of the sample structure, only that it is composed out of
>> three parts:
>>   - the metadata is an optional initial counter block on supporting
>>     firmware versions that contains a single counter, indicating the
>>     reason a sample was taken when entering global non-counting regions.
>>     This is used to provide coarse-grained information about why a sample
>>     was taken to userspace, to help userspace interpret variations in
>>     counter magnitude.
>>   - the firmware component of the sample is composed out of a global
>>     firmware counter block on supporting firmware versions.
>>   - the hardware component is the most sizeable of the three and contains
>>     a block of counters for each of the underlying hardware resources. It
>>     has a fixed structure that is described in the architecture
>>     specification, and contains the command stream hardware block(s), the
>>     tiler block(s), the MMU and L2 blocks (collectively named the memsys
>>     blocks) and the shader core blocks, in that order.
>> The structure of this buffer changes based on the firmware and hardware
>> combination, but is constant on a single system.
> 
> I already brought this up in a previous patch review. This approach of
> describing the layout of counters in the FW ringbuffer inside the kernel
> deviates from what was being done for Panfrost already, where the kernel does
> the minimal job of providing raw samples to user mode, and the UM programs need
> to rely on layout files that lend a meaning to the raw data block. In the case
> of Panfrost, because many generations of HW are supported, it seems keeping this
> deal of information in the kernel isn't very scalable, and also in my view goes
> against the practice of having the driver do as little as possible, other than
> streaming raw data to UM and letting programs handle it in more sophisticated
> ways.

Please see my response on the uAPI changes, I hope I have addressed
all of your points there.

> 
>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>> Co-developed-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
>> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
>> ---
>>   drivers/gpu/drm/panthor/panthor_fw.c   |   5 +
>>   drivers/gpu/drm/panthor/panthor_fw.h   |   9 +-
>>   drivers/gpu/drm/panthor/panthor_perf.c | 882 ++++++++++++++++++++++++-
>>   drivers/gpu/drm/panthor/panthor_perf.h |   2 +
>>   include/uapi/drm/panthor_drm.h         |   5 +-
>>   5 files changed, 892 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
>> index e9530d1d9781..cd68870ced18 100644
>> --- a/drivers/gpu/drm/panthor/panthor_fw.c
>> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
>> @@ -1000,9 +1000,12 @@ static void panthor_fw_init_global_iface(struct panthor_device *ptdev)
>>   
>>   	/* Enable interrupts we care about. */
>>   	glb_iface->input->ack_irq_mask = GLB_CFG_ALLOC_EN |
>> +					 GLB_PERFCNT_SAMPLE |
>>   					 GLB_PING |
>>   					 GLB_CFG_PROGRESS_TIMER |
>>   					 GLB_CFG_POWEROFF_TIMER |
>> +					 GLB_PERFCNT_THRESHOLD |
>> +					 GLB_PERFCNT_OVERFLOW |
>>   					 GLB_IDLE_EN |
>>   					 GLB_IDLE;
>>   
>> @@ -1031,6 +1034,8 @@ static void panthor_job_irq_handler(struct panthor_device *ptdev, u32 status)
>>   		return;
>>   
>>   	panthor_sched_report_fw_events(ptdev, status);
>> +
>> +	panthor_perf_report_irq(ptdev, status);
>>   }
>>   PANTHOR_IRQ_HANDLER(job, JOB, panthor_job_irq_handler);
>>   
>> diff --git a/drivers/gpu/drm/panthor/panthor_fw.h b/drivers/gpu/drm/panthor/panthor_fw.h
>> index db10358e24bb..7ed34d2de8b4 100644
>> --- a/drivers/gpu/drm/panthor/panthor_fw.h
>> +++ b/drivers/gpu/drm/panthor/panthor_fw.h
>> @@ -199,9 +199,10 @@ struct panthor_fw_global_control_iface {
>>   	u32 group_num;
>>   	u32 group_stridei;
>>   #define GLB_PERFCNT_FW_SIZE(x) ((((x) >> 16) << 8))
>> +#define GLB_PERFCNT_HW_SIZE(x) (((x) & GENMASK(15, 0)) << 8)
>>   	u32 perfcnt_size;
>>   	u32 instr_features;
>> -#define PERFCNT_FEATURES_MD_SIZE(x) ((x) & GENMASK(3, 0))
>> +#define PERFCNT_FEATURES_MD_SIZE(x) (((x) & GENMASK(3, 0)) << 8)
>>   	u32 perfcnt_features;
> 
> I've checked the spec and this field isn't mentioned:
> docs/g510/gpu/html/register_set/GLB_CONTROL_BLOCK.htm
> 
>>   };
>>   
>> @@ -211,7 +212,7 @@ struct panthor_fw_global_input_iface {
>>   #define GLB_CFG_ALLOC_EN			BIT(2)
>>   #define GLB_CFG_POWEROFF_TIMER			BIT(3)
>>   #define GLB_PROTM_ENTER				BIT(4)
>> -#define GLB_PERFCNT_EN				BIT(5)
>> +#define GLB_PERFCNT_ENABLE			BIT(5)
>>   #define GLB_PERFCNT_SAMPLE			BIT(6)
>>   #define GLB_COUNTER_EN				BIT(7)
>>   #define GLB_PING				BIT(8)
>> @@ -234,7 +235,6 @@ struct panthor_fw_global_input_iface {
>>   	u32 doorbell_req;
>>   	u32 reserved1;
>>   	u32 progress_timer;
>> -
>>   #define GLB_TIMER_VAL(x)			((x) & GENMASK(30, 0))
>>   #define GLB_TIMER_SOURCE_GPU_COUNTER		BIT(31)
>>   	u32 poweroff_timer;
>> @@ -244,6 +244,9 @@ struct panthor_fw_global_input_iface {
>>   	u64 perfcnt_base;
>>   	u32 perfcnt_extract;
>>   	u32 reserved3[3];
>> +#define GLB_PRFCNT_CONFIG_SIZE(x) ((x) & GENMASK(7, 0))
>> +#define GLB_PRFCNT_CONFIG_SET(x) (((x) & GENMASK(1, 0)) << 8)
>> +#define GLB_PRFCNT_METADATA_ENABLE BIT(10)
>>   	u32 perfcnt_config;
>>   	u32 perfcnt_csg_select;
>>   	u32 perfcnt_fw_enable;
> 
> In this very same file, you might want to add the following halt status bits to
> panthor_fw_global_output_iface:
> 
> struct panthor_fw_global_output_iface {
> 	u32 ack;
> 	u32 reserved1;
> 	u32 doorbell_ack;
> 	u32 reserved2;
> 	u32 halt_status;
> +#define GLB_PERFCNT_STATUS_FAILED            BIT(0)
> +#define GLB_PERFCNT_STATUS_POWERON           BIT(1)
> +#define GLB_PERFCNT_STATUS_POWEROFF          BIT(2)
> +#define GLB_PERFCNT_STATUS_PROTSESSION       BIT(3)
> 	u32 perfcnt_status;
> 	u32 perfcnt_insert;
> };

Not including them was intentional. IIRC, these are never
actually populated, so the field is always zero. The
register is removed for future GPU revisions.

>> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
>> index 42d8b6f8c45d..d62d97c448da 100644
>> --- a/drivers/gpu/drm/panthor/panthor_perf.c
>> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
>> @@ -15,7 +15,9 @@
>>   
>>   #include "panthor_device.h"
>>   #include "panthor_fw.h"
>> +#include "panthor_gem.h"
>>   #include "panthor_gpu.h"
>> +#include "panthor_mmu.h"
>>   #include "panthor_perf.h"
>>   #include "panthor_regs.h"
>>   
>> @@ -26,6 +28,41 @@
>>    */
>>   #define PANTHOR_PERF_EM_BITS (BITS_PER_TYPE(u64) * 2)
>>   
>> +/**
>> + * PANTHOR_PERF_FW_RINGBUF_SLOTS - Number of slots allocated for individual samples when configuring
>> + *                                 the performance counter ring buffer to firmware. This can be
>> + *                                 used to reduce memory consumption on low memory systems.
>> + */
>> +#define PANTHOR_PERF_FW_RINGBUF_SLOTS (32)
> 
> Perhaps this should be a module parameter with a default value of 32?
> 

Either module param or build-time option would work for me.

>> +
>> +/**
>> + * PANTHOR_CTR_TIMESTAMP_LO - The first architecturally mandated counter of every block type
>> + *                            contains the low 32-bits of the TIMESTAMP value.
>> + */
>> +#define PANTHOR_CTR_TIMESTAMP_LO (0)
>> +
>> +/**
>> + * PANTHOR_CTR_TIMESTAMP_HI - The register offset containinig the high 32-bits of the TIMESTAMP
>> + *                            value.
>> + */
>> +#define PANTHOR_CTR_TIMESTAMP_HI (1)
>> +
>> +/**
>> + * PANTHOR_CTR_PRFCNT_EN - The register offset containing the enable mask for the enabled counters
>> + *                         that were written to memory.
>> + */
>> +#define PANTHOR_CTR_PRFCNT_EN (2)
>> +
>> +/**
>> + * PANTHOR_HEADER_COUNTERS - The first four counters of every block type are architecturally
>> + *                           defined to be equivalent. The fourth counter is always reserved,
>> + *                           and should be zero and as such, does not have a separate define.
>> + *
>> + *                           These are the only four counters that are the same between different
>> + *                           blocks and are consistent between different architectures.
>> + */
>> +#define PANTHOR_HEADER_COUNTERS (4)
>> +
>>   /**
>>    * enum panthor_perf_session_state - Session state bits.
>>    */
>> @@ -158,6 +195,135 @@ struct panthor_perf_session {
>>   	struct kref ref;
>>   };
>>   
>> +struct panthor_perf_buffer_descriptor {
>> +	/**
>> +	 * @block_size: The size of a single block in the FW ring buffer, equal to
>> +	 *              sizeof(u32) * counters_per_block.
>> +	 */
>> +	size_t block_size;
>> +
>> +	/**
>> +	 * @buffer_size: The total size of the buffer, equal to (#hardware blocks +
>> +	 *               #firmware blocks) * block_size.
>> +	 */
>> +	size_t buffer_size;
>> +
>> +	/**
>> +	 * @available_blocks: Bitmask indicating the blocks supported by the hardware and firmware
>> +	 *                    combination. Note that this can also include blocks that will not
>> +	 *                    be exposed to the user.
>> +	 */
>> +	DECLARE_BITMAP(available_blocks, DRM_PANTHOR_PERF_BLOCK_MAX);
>> +	struct {
>> +		/** @offset: Starting offset of a block of type @type in the FW ringbuffer. */
>> +		size_t offset;
>> +
>> +		/** @type: Type of the blocks between @blocks[i].offset and @blocks[i+1].offset. */
>> +		enum drm_panthor_perf_block_type type;
> 
> I think perhaps you could avoid using this, because a block type number is the
> same as its index in the blocks array. See [1]
> 

That does simplify the implementation a bit, thank you.

>> +		/** @block_count: Number of blocks of the given @type, starting at @offset. */
>> +		size_t block_count;
>> +	} blocks[DRM_PANTHOR_PERF_BLOCK_MAX];
>> +};
> 
> It seems with this approach for storing the layout of counters, this would depend
> on them being arranged always in the same way. In Panfros, however, counter layout
> was being handled in UM by parsing XML files into a C file that filled in as a
> buffer descriptor. I'm afraid maybe pushing this into kernel space might make adding
> support for new devices with different counter layout not easy.
> 

 From what I understand of the Mesa implementation, Panfrost has the
exact same expectation, but in userspace: the number of hardware
blocks is hard-coded, as is their ordering, whereas the offsets
of counters are then stored in the XML.

I agree that the counters should be handled in userspace, since
they change from GPU to GPU. The layout algorithm mostly stays
consistent and has primarily backwards compatible changes
made to it.

>> +
>> +/**
>> + * STRUCT panthor_perf_sampler - Interface to de-multiplex firmware interaction and handle
>> + *                               global interactions.
>> + */
>> +struct panthor_perf_sampler {
>> +	/** @sample_requested: A sample has been requested. */
>> +	bool sample_requested;
>> +
>> +	/**
>> +	 * @last_ack: Temporarily storing the last GLB_ACK status. Without storing this data,
>> +	 *            we do not know whether a toggle bit has been handled.
>> +	 */
>> +	u32 last_ack;
>> +
>> +	/**
>> +	 * @enabled_clients: The number of clients concurrently requesting samples. To ensure that
>> +	 *                   one client cannot deny samples to another, we must ensure that clients
>> +	 *                   are effectively reference counted.
>> +	 */
>> +	atomic_t enabled_clients;
>> +
>> +	/**
>> +	 * @sample_handled: Synchronization point between the interrupt bottom half and the
>> +	 *                  main sampler interface. Must be re-armed solely on a new request
>> +	 *                  coming to the sampler.
>> +	 */
>> +	struct completion sample_handled;
>> +
>> +	/** @rb: Kernel BO in the FW AS containing the sample ringbuffer. */
>> +	struct panthor_kernel_bo *rb;
>> +
>> +	/**
>> +	 * @sample_size: The size of a single sample in the FW ringbuffer. This is computed using
>> +	 *               the hardware configuration according to the architecture specification,
>> +	 *               and cross-validated against the sample size reported by FW to ensure
>> +	 *               a consistent view of the buffer size.
>> +	 */
>> +	size_t sample_size;
>> +
>> +	/**
>> +	 * @sample_slots: Number of slots for samples in the FW ringbuffer. Could be static,
>> +	 *		  but may be useful to customize for low-memory devices.
>> +	 */
>> +	size_t sample_slots;
>> +
>> +	/**
>> +	 * @config_lock: Lock serializing changes to the global counter configuration, including
>> +	 *               requested counter set and the counters themselves.
>> +	 */
>> +	struct mutex config_lock;
>> +
>> +	/**
>> +	 * @ems: List of enable maps of the active sessions. When removing a session, the number
>> +	 *       of requested counters may decrease, and the union of enable masks from the multiple
>> +	 *       sessions does not provide sufficient information to reconstruct the previous
>> +	 *       enable mask.
>> +	 */
>> +	struct list_head ems;
> 
> Maybe ems and config lock could be in the same anonymous structure.
> 
>> +
>> +	/** @em: Combined enable mask for all of the active sessions. */
>> +	struct panthor_perf_enable_masks *em;
>> +
>> +	/**
>> +	 * @desc: Buffer descriptor for a sample in the FW ringbuffer. Note that this buffer
>> +	 *        at current time does some interesting things with the zeroth block type. On
>> +	 *        newer FW revisions, the first counter block of the sample is the METADATA block,
>> +	 *        which contains a single value indicating the reason the sample was taken (if
>> +	 *        any). This block must not be exposed to userspace, as userspace does not
>> +	 *        have sufficient context to interpret it. As such, this block type is not
>> +	 *        added to the uAPI, but we still use it in the kernel.
>> +	 */
>> +	struct panthor_perf_buffer_descriptor desc;
>> +
>> +	/**
>> +	 * @sample: Pointer to an upscaled and annotated sample that may be emitted to userspace.
>> +	 *          This is used both as an intermediate buffer to do the zero-extension of the
>> +	 *          32-bit counters to 64-bits and as a storage buffer in case the sampler
>> +	 *          requests an additional sample that was not requested by any of the top-level
>> +	 *          sessions (for instance, when changing the enable masks).
>> +	 */
>> +	u8 *sample;
>> +
>> +	/** @sampler_lock: Lock used to guard the list of sessions requesting samples. */
>> +	struct mutex sampler_lock;
>> +
>> +	/** @sampler_list: List of sessions requesting samples. */
>> +	struct list_head sampler_list;
> 
> Shouldn't this be called session list instead?
> Wouldn't it better to include both the list and its mutex into a single anonymous structure?
> 
>> +	/** @set_config: The set that will be configured onto the hardware. */
>> +	u8 set_config;
>> +
>> +	/**
>> +	 * @ptdev: Backpointer to the Panthor device, needed to ring the global doorbell and
>> +	 *         interface with FW.
>> +	 */
>> +	struct panthor_device *ptdev;
>> +};
>>   
>>   struct panthor_perf {
>>   	/**
>> @@ -175,6 +341,9 @@ struct panthor_perf {
>>   	 * @sessions: Global map of sessions, accessed by their ID.
>>   	 */
>>   	struct xarray sessions;
>> +
>> +	/** @sampler: FW control interface. */
>> +	struct panthor_perf_sampler sampler;
>>   };
>>   
>>   /**
>> @@ -247,6 +416,23 @@ static struct panthor_perf_enable_masks *panthor_perf_create_em(struct drm_panth
>>   	return em;
>>   }
>>   
>> +static void panthor_perf_em_add(struct panthor_perf_enable_masks *dst_em,
>> +		const struct panthor_perf_enable_masks *const src_em)
>> +{
>> +	size_t i = 0;
>> +
>> +	for (i = DRM_PANTHOR_PERF_BLOCK_FW; i <= DRM_PANTHOR_PERF_BLOCK_LAST; i++)
>> +		bitmap_or(dst_em->mask[i], dst_em->mask[i], src_em->mask[i], PANTHOR_PERF_EM_BITS);
>> +}
>> +
>> +static void panthor_perf_em_zero(struct panthor_perf_enable_masks *em)
>> +{
>> +	size_t i = 0;
>> +
>> +	for (i = DRM_PANTHOR_PERF_BLOCK_FW; i <= DRM_PANTHOR_PERF_BLOCK_LAST; i++)
>> +		bitmap_zero(em->mask[i], PANTHOR_PERF_EM_BITS);
>> +}
>> +
>>   static void panthor_perf_destroy_em_kref(struct kref *em_kref)
>>   {
>>   	struct panthor_perf_enable_masks *em = container_of(em_kref, typeof(*em), refs);
>> @@ -270,6 +456,12 @@ static u32 session_read_extract_idx(struct panthor_perf_session *session)
>>   	return smp_load_acquire(session->extract_idx);
>>   }
>>   
>> +static void session_write_insert_idx(struct panthor_perf_session *session, u32 idx)
>> +{
>> +	/* Userspace needs the insert index to know where to look for the sample. */
>> +	smp_store_release(session->insert_idx, idx);
>> +}
>> +
>>   static u32 session_read_insert_idx(struct panthor_perf_session *session)
>>   {
>>   	return *session->insert_idx;
>> @@ -349,6 +541,70 @@ static struct panthor_perf_session *session_find(struct panthor_file *pfile,
>>   	return session;
>>   }
>>   
>> +static u32 compress_enable_mask(unsigned long *const src)
>> +{
>> +	size_t i;
>> +	u32 result = 0;
>> +	unsigned long clump;
>> +
>> +	for_each_set_clump8(i, clump, src, PANTHOR_PERF_EM_BITS) {
>> +		const unsigned long shift = div_u64(i, 4);
>> +
>> +		result |= !!(clump & GENMASK(3, 0)) << shift;
>> +		result |= !!(clump & GENMASK(7, 4)) << (shift + 1);
>> +	}
>> +
>> +	return result;
>> +}
>> +
>> +static void expand_enable_mask(u32 em, unsigned long *const dst)
>> +{
>> +	size_t i;
>> +	DECLARE_BITMAP(emb, BITS_PER_TYPE(u32));
>> +
>> +	bitmap_from_arr32(emb, &em, BITS_PER_TYPE(u32));
>> +
>> +	for_each_set_bit(i, emb, BITS_PER_TYPE(u32))
>> +		bitmap_set(dst, i * 4, 4);
>> +}
>> +
>> +/**
>> + * panthor_perf_block_data - Identify the block index and type based on the offset.
>> + *
>> + * @desc:   FW buffer descriptor.
>> + * @offset: The current offset being examined.
>> + * @idx:    Pointer to an output index.
>> + * @type:   Pointer to an output block type.
>> + *
>> + * To disambiguate different types of blocks as well as different blocks of the same type,
>> + * the offset into the FW ringbuffer is used to uniquely identify the block being considered.
>> + *
>> + * In the future, this is a good time to identify whether a block will be empty,
>> + * allowing us to short-circuit its processing after emitting header information.
>> + */
>> +static void panthor_perf_block_data(struct panthor_perf_buffer_descriptor *const desc,
>> +		size_t offset, u32 *idx, enum drm_panthor_perf_block_type *type)
>> +{
>> +	unsigned long id;
>> +
>> +	for_each_set_bit(id, desc->available_blocks, DRM_PANTHOR_PERF_BLOCK_LAST) {
>> +		const size_t block_start = desc->blocks[id].offset;
>> +		const size_t block_count = desc->blocks[id].block_count;
>> +		const size_t block_end = desc->blocks[id].offset +
>> +			desc->block_size * block_count;
>> +
>> +		if (!block_count)
>> +			continue;
>> +
>> +		if ((offset >= block_start) && (offset < block_end)) {
>> +			*type = desc->blocks[id].type;
> 
>    [1] I think in this case, id will always be the same as desc->blocks[id].type, so maybe
>    
> 
>> +			*idx = div_u64(offset - desc->blocks[id].offset, desc->block_size);
>> +
>> +			return;
>> +		}
>> +	}
>> +}
>> +
>>   static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *const info)
>>   {
>>   	const size_t block_size = get_annotated_block_size(info->counters_per_block);
>> @@ -358,6 +614,520 @@ static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *co
>>   	return sizeof(struct drm_panthor_perf_sample_header) + (block_size * block_nr);
>>   }
>>   
>> +static u32 panthor_perf_handle_sample(struct panthor_device *ptdev, u32 extract_idx, u32 insert_idx)
>> +{
>> +	struct panthor_perf *perf = ptdev->perf;
>> +	struct panthor_perf_sampler *sampler = &ptdev->perf->sampler;
>> +	const size_t ann_block_size =
>> +		get_annotated_block_size(ptdev->perf_info.counters_per_block);
>> +	u32 i;
>> +
>> +	for (i = extract_idx; i != insert_idx; i = (i + 1) % sampler->sample_slots) {
>> +		u8 *fw_sample = (u8 *)sampler->rb->kmap + i * sampler->sample_size;
>> +
>> +		for (size_t fw_off = 0, ann_off = sizeof(struct drm_panthor_perf_sample_header);
>> +				fw_off < sampler->desc.buffer_size;
>> +				fw_off += sampler->desc.block_size)
>> +
>> +		{
>> +			u32 idx;
>> +			enum drm_panthor_perf_block_type type;
>> +			DECLARE_BITMAP(expanded_em, PANTHOR_PERF_EM_BITS);
>> +			struct panthor_perf_counter_block *blk =
>> +				(typeof(blk))(perf->sampler.sample + ann_off);
>> +			const u32 prfcnt_en = blk->counters[PANTHOR_CTR_PRFCNT_EN];
> 
> This implies there was something in blk->counters[PANTHOR_CTR_PRFCNT_EN], but
> where was it first written?
> Shouldn't this be fetched from fw_sample? sampler.sample is being reset to 0 after a sample
> is copied into the UM ringbuffer, so prfcnt_en will always be 0 here.
> 

It should be, I mixed the two up.

>> +
>> +			panthor_perf_block_data(&sampler->desc, fw_off, &idx, &type);
>> +
>> +			/**
>> +			 * TODO Data from the metadata block must be used to populate the
>> +			 * block state information.
>> +			 */
>> +			if (type == DRM_PANTHOR_PERF_BLOCK_METADATA)
>> +				continue;
>> +
>> +			expand_enable_mask(prfcnt_en, expanded_em);
>> +
>> +			blk->header = (struct drm_panthor_perf_block_header) {
>> +				.clock = 0,
>> +				.block_idx = idx,
>> +				.block_type = type,
>> +				.block_states = DRM_PANTHOR_PERF_BLOCK_STATE_UNKNOWN
>> +			};
>> +			bitmap_to_arr64(blk->header.enable_mask, expanded_em, PANTHOR_PERF_EM_BITS);
>> +
>> +			u32 *block = (u32 *)(fw_sample + fw_off);
>> +
>> +			/*
>> +			 * The four header counters must be treated differently, because they are
>> +			 * not additive. For the fourth, the assignment does not matter, as it
>> +			 * is reserved and should be zero.
>> +			 */
>> +			blk->counters[PANTHOR_CTR_TIMESTAMP_LO] = block[PANTHOR_CTR_TIMESTAMP_LO];
>> +			blk->counters[PANTHOR_CTR_TIMESTAMP_HI] = block[PANTHOR_CTR_TIMESTAMP_HI];
>> +			blk->counters[PANTHOR_CTR_PRFCNT_EN] = block[PANTHOR_CTR_PRFCNT_EN];
>> +
>> +			for (size_t k = PANTHOR_HEADER_COUNTERS;
>> +					k < ptdev->perf_info.counters_per_block;
>> +					k++)
>> +				blk->counters[k] += block[k];
>> +
>> +			ann_off += ann_block_size;
>> +		}
>> +	}
>> +
>> +	return i;
> 
> This will always return insert_idx, so why return it the caller when it makes no difference?
> 

Can stop returning the insert index here. Depending on the sampling
rate, we may need to implement some back-pressure to ensure this
loop does not consume too much time.

>> +}
>> +
>> +static size_t panthor_perf_get_fw_reported_size(struct panthor_device *ptdev)
>> +{
>> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
>> +
>> +	size_t fw_size = GLB_PERFCNT_FW_SIZE(glb_iface->control->perfcnt_size);
>> +	size_t hw_size = GLB_PERFCNT_HW_SIZE(glb_iface->control->perfcnt_size);
>> +	size_t md_size = PERFCNT_FEATURES_MD_SIZE(glb_iface->control->perfcnt_features);
>> +
>> +	return md_size + fw_size + hw_size;
>> +}
>> +
>> +#define PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, typ, blk_count, offset) \
>> +	({ \
>> +		(desc)->blocks[(typ)].type = (typ); \
>> +		(desc)->blocks[(typ)].offset = (offset); \
>> +		(desc)->blocks[(typ)].block_count = (blk_count);  \
>> +		if ((blk_count))                                    \
>> +			set_bit((typ), (desc)->available_blocks); \
>> +		(offset) + ((desc)->block_size) * (blk_count); \
>> +	 })
>> +
>> +static int panthor_perf_setup_fw_buffer_desc(struct panthor_device *ptdev,
>> +		struct panthor_perf_sampler *sampler)
>> +{
>> +	const struct drm_panthor_perf_info *const info = &ptdev->perf_info;
>> +	const size_t block_size = info->counters_per_block * sizeof(u32);
>> +	struct panthor_perf_buffer_descriptor *desc = &sampler->desc;
>> +	const size_t fw_sample_size = panthor_perf_get_fw_reported_size(ptdev);
>> +	size_t offset = 0;
>> +
>> +	desc->block_size = block_size;
> 
> block_size is only used in this assignment, so maybe do away with the automatic
> variable altogether?

It's also used by PANTHOR_PERF_SET_BLOCK_DESC_DATA to set the region of
the FW sample that is occupied by a block of a specific type.

> 
> desc->block_size = info->counters_per_block * sizeof(u32);
> 
> Also, where is desc->available_blocks being set?
> 
It's being set in PANTHOR_PERF_SET_BLOCK_DESC_DATA, depending on the
availability of blocks.
> 
>> +	for (enum drm_panthor_perf_block_type type = 0; type < DRM_PANTHOR_PERF_BLOCK_MAX; type++) {
>> +		switch (type) {
>> +		case DRM_PANTHOR_PERF_BLOCK_METADATA:
>> +			if (info->flags & DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT)
>> +				offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc,
>> +					DRM_PANTHOR_PERF_BLOCK_METADATA, 1, offset);
>> +			break;
>> +		case DRM_PANTHOR_PERF_BLOCK_FW:
>> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->fw_blocks,
>> +					offset);
>> +			break;
>> +		case DRM_PANTHOR_PERF_BLOCK_CSG:
>> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->csg_blocks,
>> +					offset);
>> +			break;
>> +		case DRM_PANTHOR_PERF_BLOCK_CSHW:
>> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->cshw_blocks,
>> +					offset);
>> +			break;
>> +		case DRM_PANTHOR_PERF_BLOCK_TILER:
>> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->tiler_blocks,
>> +					offset);
>> +			break;
>> +		case DRM_PANTHOR_PERF_BLOCK_MEMSYS:
>> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->memsys_blocks,
>> +					offset);
>> +			break;
>> +		case DRM_PANTHOR_PERF_BLOCK_SHADER:
>> +			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, info->shader_blocks,
>> +					offset);
>> +			break;
>> +		case DRM_PANTHOR_PERF_BLOCK_MAX:
>> +			drm_WARN_ON_ONCE(&ptdev->base,
>> +					"DRM_PANTHOR_PERF_BLOCK_MAX should be unreachable!");
>> +			break;
>> +		}
>> +	}
> 
> Maybe to spare some code, because 'type' gives you the u32 pointer offset minus one from
> drm_panthor_perf_info::fw_blocks, you could do as follows:
> 
>     for (enum drm_panthor_perf_block_type type = 0; type < DRM_PANTHOR_PERF_BLOCK_MAX; type++) {
> 	switch (type) {
> 	case DRM_PANTHOR_PERF_BLOCK_METADATA:
> 		if (info->flags & DRM_PANTHOR_PERF_BLOCK_STATES_SUPPORT)
> 			offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc,
> 				DRM_PANTHOR_PERF_BLOCK_METADATA, 1, offset);
> 		break;
> 	case DRM_PANTHOR_PERF_BLOCK_MAX:
> 		drm_WARN_ON_ONCE(&ptdev->base,
> 				"DRM_PANTHOR_PERF_BLOCK_MAX should be unreachable!");
> 		break;
> 	default:
> 		u32 blk_count = *((&info->fw_blocks)+(type-1));
> 		offset = PANTHOR_PERF_SET_BLOCK_DESC_DATA(desc, type, blk_count, offset);
> 		break;
> 	}
> }
> 
> I already suggested a similar thing in a previous patch review, but this would depend
> on the compiler not inserting funky padding space between consecutive structure members,
> but all of them being u32 I guess that's not possible?
> 

I think I would prefer to avoid it in this case. It does save on some
code, but it seems less clear than the current implementation.

>> +
>> +	/* Computed size is not the same as the reported size, so we should not proceed in
>> +	 * initializing the sampling session.
>> +	 */
>> +	if (offset != fw_sample_size)
>> +		return -EINVAL;
>> +
>> +	desc->buffer_size = offset;
>> +
>> +	return 0;
>> +}
>> +
>> +static int panthor_perf_fw_stop_sampling(struct panthor_device *ptdev)
>> +{
>> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
>> +	u32 acked;
>> +	int ret;
>> +
>> +	if (~READ_ONCE(glb_iface->input->req) & GLB_PERFCNT_ENABLE)
>> +		return 0;
>> +
>> +	panthor_fw_update_reqs(glb_iface, req, 0, GLB_PERFCNT_ENABLE);
>> +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
>> +	ret = panthor_fw_glb_wait_acks(ptdev, GLB_PERFCNT_ENABLE, &acked, 100);
>> +	if (ret)
>> +		drm_warn(&ptdev->base, "Could not disable performance counters");
> 
> After this, don't we have to set the selected counters to 0 through the fw's glb interface?
> 
>        	   glb_iface->input->* = 0;

We do, I should be setting it on the enable path rather than the disable
path.

> 
>> +
>> +	return ret;
>> +}
>> +
>> +static int panthor_perf_fw_start_sampling(struct panthor_device *ptdev)
>> +{
>> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
>> +	u32 acked;
>> +	int ret;
>> +
>> +	if (READ_ONCE(glb_iface->input->req) & GLB_PERFCNT_ENABLE)
>> +		return 0;
>> +
>> +	panthor_fw_update_reqs(glb_iface, req, GLB_PERFCNT_ENABLE, GLB_PERFCNT_ENABLE);
>> +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
>> +	ret = panthor_fw_glb_wait_acks(ptdev, GLB_PERFCNT_ENABLE, &acked, 100);
>> +	if (ret)
>> +		drm_warn(&ptdev->base, "Could not enable performance counters");
> 
> Wouldn't you have to enable counters in the FW's global input interface? As in
> calling panthor_perf_fw_write_em(sampler, sampler->em)
> 

That is the intent. The two actions are somewhat decoupled, because you
can program the interface and not immediately start sampling, or
you can start/stop sampling around events that do not clear the
FW's GLB interface programming.

>> +
>> +	return ret;
>> +}
>> +
>> +static void panthor_perf_fw_write_em(struct panthor_perf_sampler *sampler,
>> +		struct panthor_perf_enable_masks *em)
>> +{
>> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(sampler->ptdev);
>> +	u32 perfcnt_config;
>> +
>> +	glb_iface->input->perfcnt_csf_enable =
>> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_CSHW]);
> 
> Wouldn't it be enough to get the lowest 32 bits of every bitmap for the actual mask?
> 

How so? We are trying to provide one bit per counter for the user
to individually select counters. This is also somewhat forward-looking,
because the Mali-G715 and onwards have 128 counters for every block.

>> +	glb_iface->input->perfcnt_shader_enable =
>> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_SHADER]);
>> +	glb_iface->input->perfcnt_mmu_l2_enable =
>> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_MEMSYS]);
>> +	glb_iface->input->perfcnt_tiler_enable =
>> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_TILER]);
>> +	glb_iface->input->perfcnt_fw_enable =
>> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_FW]);
>> +	glb_iface->input->perfcnt_csg_enable =
>> +		compress_enable_mask(em->mask[DRM_PANTHOR_PERF_BLOCK_CSG]);
>> +
>> +	perfcnt_config = GLB_PRFCNT_CONFIG_SIZE(PANTHOR_PERF_FW_RINGBUF_SLOTS);
> 
> Maybe replace this with
> 	perfcnt_config = GLB_PRFCNT_CONFIG_SIZE(sampler->sample_slots);
> in case the number of slots is ever made a module parameter?
> 
>> +	perfcnt_config |= GLB_PRFCNT_CONFIG_SET(sampler->set_config);
>> +	glb_iface->input->perfcnt_config = perfcnt_config;
>> +
>> +	/**
>> +	 * The spec mandates that the host zero the PRFCNT_EXTRACT register before an enable
>> +	 * operation, and each (re-)enable will require an enable-disable pair to program
>> +	 * the new changes onto the FW interface.
>> +	 */
>> +	WRITE_ONCE(glb_iface->input->perfcnt_extract, 0);
> 
>   Wouldn't this better go in panthor_perf_fw_stop_sampling()?
> 

The spec says that it must be cleared to zero before enabling the
performance counters, so I moved the clearing to be present
in the enabling. This way, the same helper works for both
initialization and re-initialization.

>> +}
>> +
>> +static void session_populate_sample_header(struct panthor_perf_session *session,
>> +		struct drm_panthor_perf_sample_header *hdr)
>> +{
>> +	hdr->block_set = 0;
>> +	hdr->user_data = session->user_data;
>> +	hdr->timestamp_start_ns = session->sample_start_ns;
>> +	/**
>> +	 * TODO This should be changed to use the GPU clocks and the TIMESTAMP register,
>> +	 * when support is added.
>> +	 */
> 
> Timestamp register is already available and used in the driver.
> 
>> +	hdr->timestamp_end_ns = ktime_get_raw_ns();

With timestamp start, we want to capture when the kernel requested a
timestamp. With the timestamp end, we want to instead capture when
the FW actually created the sample. This would involve using the
TIMESTAMP_LO and TIMESTAMP_HI values, along with the TIMESTAMP_OFFSET
register to ensure that the FW timestamps are aligned with those
of the host.

>> +}
>> +
>> +/**
>> + * session_patch_sample - Update the PRFCNT_EN header counter and the counters exposed to the
>> + *                        userspace client to only contain requested counters.
>> + *
>> + * @ptdev: Panthor device
>> + * @session: Perf session
>> + * @sample: Starting offset of the sample in the userspace mapping.
>> + *
>> + * The hardware supports counter selection at the granularity of 1 bit per 4 counters, and there
>> + * is a single global FW frontend to program the counter requests from multiple sessions. This may
>> + * lead to a large disparity between the requested and provided counters for an individual client.
>> + * To remove this cross-talk, we patch out the counters that have not been requested by this
>> + * session and update the PRFCNT_EN, the header counter containing a bitmask of enabled counters,
>> + * accordingly.
>> + */
>> +static void session_patch_sample(struct panthor_device *ptdev,
>> +		struct panthor_perf_session *session, u8 *sample)
>> +{
>> +	const struct drm_panthor_perf_info *const perf_info = &ptdev->perf_info;
>> +
>> +	const size_t block_size = get_annotated_block_size(perf_info->counters_per_block);
>> +	const size_t sample_size = session_get_max_sample_size(perf_info);
> 
> I think maybe you want the fw ring sample size minus the user ring header:
> const size_t blocks_size = sample_size - sizeof(struct drm_panthor_perf_sample_header);
> 

Definitely. The sequence here will be somewhat changed for the next
patch set.

>> +	for (size_t i = 0; i < sample_size; i += block_size) {
> Same as above:
> 
> 	for (size_t i = 0; i < blocks_size; i += block_size) {
> 
>> +		size_t ctr_idx;
>> +		DECLARE_BITMAP(em_diff, PANTHOR_PERF_EM_BITS);
>> +		struct panthor_perf_counter_block *blk = (typeof(blk))(sample + block_size);
> Same as above:
>                  struct panthor_perf_counter_block *blk = (typeof(blk))(sample + i);
> Otherwise the iteration index isn't used.
>> +		enum drm_panthor_perf_block_type type = blk->header.block_type;
>> +		unsigned long *blk_em = session->enabled_counters->mask[type];
>> +
>> +		bitmap_from_arr64(em_diff, blk->header.enable_mask, PANTHOR_PERF_EM_BITS);
>> +
>> +		bitmap_andnot(em_diff, em_diff, blk_em, PANTHOR_PERF_EM_BITS);
>> +
>> +		for_each_set_bit(ctr_idx, em_diff, PANTHOR_PERF_EM_BITS)
>> +			blk->counters[ctr_idx] = 0;
>> +
>> +		bitmap_to_arr64(blk->header.enable_mask, blk_em, PANTHOR_PERF_EM_BITS);
>> +	}
>> +}
>> +
>> +static int session_copy_samplei(struct panthor_device *ptdev,
>> +		struct panthor_perf_session *session)
>> +{
>> +	struct panthor_perf *perf = ptdev->perf;
>> +	const size_t sample_size = session_get_max_sample_size(&ptdev->perf_info);
>> +	const u32 insert_idx = session_read_insert_idx(session);
>> +	const u32 extract_idx = session_read_extract_idx(session);
>> +	u8 *new_sample;
>> +
>> +	if (!CIRC_SPACE_TO_END(insert_idx, extract_idx, session->ringbuf_slots))
>> +		return -ENOSPC;
>> +
>> +	new_sample = session->samples + extract_idx * sample_size;
> 
> Wouldn' this have to insert_idx?
> 

It should indeed be.

>> +
>> +	memcpy(new_sample, perf->sampler.sample, sample_size);
> 
> So effectively we're doing two copies here, one from the FW ringbuffer into the
> sampler, and another one from the sampler into the user ringbuffer. This sounds
> expensive. This is in line with what I already mentioned above, essentially that
> I thougth the FW ringbuffer would be somehow made available to UM samplers so
> as to afford zero copy.
> 

I would argue that in the case of accumulation, i.e., if we have
multiple samples in the FW ring buffer at a time, we are saving on
quite a lot of copies. In particular, if we go through power
management state transitions, and automatic samples are generated,
we will only get a single copy per n samples, rather than n copies
per n samples.

Operating on a counter by counter basis, we only really need to
copy the sample once, and then we can accumulate on the counters
that we need, which effectively makes some sessions have very
little overhead in the copying.

Exposing the FW ring buffer does not necessarily provide
enough information about how to process the sample to userspace.
Besides the discussion points on the uAPI about the layout algorithm
, we also have the firmware sampling modes, which may either
be manual or manual_no_clear, where the counters may be cleared
after a manual sample or not. Exposing enough information to
userspace to make the processing simple is also somewhat
prohibitive, and definitely breaks compatibility guarantees.

>> +	session_populate_sample_header(session,
>> +			(struct drm_panthor_perf_sample_header *)new_sample);
>> +
>> +	session_patch_sample(ptdev, session, new_sample +
>> +			sizeof(struct drm_panthor_perf_sample_header));
>> +
>> +	session_write_insert_idx(session, (insert_idx + 1) % session->ringbuf_slots);
>> +
>> +	/* Since we are about to notify userspace, we must ensure that all changes to memory
>> +	 * are visible.
>> +	 */
>> +	wmb();
>> +
>> +	eventfd_signal(session->eventfd);
>> +
>> +	return 0;
>> +}
>> +
>> +#define PERFCNT_IRQS (GLB_PERFCNT_OVERFLOW | GLB_PERFCNT_SAMPLE | GLB_PERFCNT_THRESHOLD)
>> +
>> +void panthor_perf_report_irq(struct panthor_device *ptdev, u32 status)
>> +{
>> +	struct panthor_perf *const perf = ptdev->perf;
>> +	struct panthor_perf_sampler *sampler;
>> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
>> +
>> +	if (!(status & JOB_INT_GLOBAL_IF))
>> +		return;
>> +
>> +	if (!perf)
>> +		return;
>> +
>> +	sampler = &perf->sampler;
>> +
>> +	/* TODO This needs locking. */
>> +	const u32 ack = READ_ONCE(glb_iface->output->ack);
>> +	const u32 fw_events = sampler->last_ack ^ ack;
> 
> I think I would do it like this:
> 
>   	const u32 req = READ_ONCE(glb_iface->input->req);
>   	const u32 ack = READ_ONCE(glb_iface->output->ack);
> 
>   	if (!(~(req ^ ack) & PERFCNT_IRQS))
>   		return;
> 

That did not work for me without first checking
whether the bit was set previously. I'll try to retest
and see if there was some other issue preventing
this from working.

>> +	sampler->last_ack = ack;
>> +
>> +	if (!(fw_events & PERFCNT_IRQS))
>> +		return;
>> +
>> +	/* TODO Fix up the error handling for overflow. */
>> +	if (fw_events & GLB_PERFCNT_OVERFLOW)
>> +		return;
>> +
>> +	if (fw_events & (GLB_PERFCNT_SAMPLE | GLB_PERFCNT_THRESHOLD)) {
>> +		const u32 extract_idx = READ_ONCE(glb_iface->input->perfcnt_extract);
>> +		const u32 insert_idx = READ_ONCE(glb_iface->output->perfcnt_insert);
>> +
>> +		WRITE_ONCE(glb_iface->input->perfcnt_extract,
>> +				panthor_perf_handle_sample(ptdev, extract_idx, insert_idx));
>> +	}
>> +
>> +	scoped_guard(mutex, &sampler->sampler_lock)
>> +	{
>> +		struct list_head *pos, *temp;
>> +
>> +		list_for_each_safe(pos, temp, &sampler->sampler_list) {
>> +			struct panthor_perf_session *session = list_entry(pos,
>> +					struct panthor_perf_session, waiting);
>> +
>> +			session_copy_sample(ptdev, session);
>> +			list_del_init(pos);
> 
> I guess in the case of samples being taken periodically, you would not delete
> the session from the sampler's list?
> 

Essentially.

>> +
>> +			session_put(session);
>> +		}
>> +	}
>> +
>> +	memset(sampler->sample, 0, session_get_max_sample_size(&ptdev->perf_info));
> 
> I don't think sampler->sample has the user sample header in it, so we might be
> zero'ing out more memory than really needed.
> 

Checked on this and this was one of the bugs that we found internally.
Thanks for the catch!

>> +	sampler->sample_requested = false;
>> +	complete(&sampler->sample_handled);
>> +}
>> +
>> +
>> +static int panthor_perf_sampler_init(struct panthor_perf_sampler *sampler,
>> +		struct panthor_device *ptdev)
>> +{
>> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
>> +	struct panthor_kernel_bo *bo;
>> +	u8 *sample;
>> +	int ret;
>> +
>> +	ret = panthor_perf_setup_fw_buffer_desc(ptdev, sampler);
>> +	if (ret) {
>> +		drm_err(&ptdev->base,
>> +				"Failed to setup descriptor for FW ring buffer, err = %d", ret);
>> +		return ret;
>> +	}
>> +
>> +	bo = panthor_kernel_bo_create(ptdev, panthor_fw_vm(ptdev),
>> +			sampler->desc.buffer_size * PANTHOR_PERF_FW_RINGBUF_SLOTS,
>> +			DRM_PANTHOR_BO_NO_MMAP,
>> +			DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
>> +			PANTHOR_VM_KERNEL_AUTO_VA);
>> +
>> +	if (IS_ERR_OR_NULL(bo))
>> +		return IS_ERR(bo) ? PTR_ERR(bo) : -ENOMEM;
>> +
>> +	ret = panthor_kernel_bo_vmap(bo);
>> +	if (ret)
>> +		goto cleanup_bo;
>> +
>> +	sample = devm_kzalloc(ptdev->base.dev,
>> +			session_get_max_sample_size(&ptdev->perf_info), GFP_KERNEL);
>> +	if (ZERO_OR_NULL_PTR(sample)) {
>> +		ret = -ENOMEM;
>> +		goto cleanup_vmap;
>> +	}
>> +
>> +	glb_iface->input->perfcnt_as = panthor_vm_as(panthor_fw_vm(ptdev));
>> +	glb_iface->input->perfcnt_base = panthor_kernel_bo_gpuva(bo);
>> +	glb_iface->input->perfcnt_extract = 0;
>> +	glb_iface->input->perfcnt_csg_select = GENMASK(glb_iface->control->group_num, 0);
>> +
>> +	sampler->rb = bo;
>> +	sampler->sample = sample;
>> +	sampler->sample_slots = PANTHOR_PERF_FW_RINGBUF_SLOTS;
> 
> I already mentioned this, but I think it'd be nice to have this configurable
> through a module parameter.
> 
> Also, I suspect you might've forgotten this assignment right here:
> 
> 	sampler->sample_size = sampler->desc.buffer_size;
> 

I think I will remove it, since we already have the sample size
available via helper.

Before I forget, I thought module parameters for new
drivers were frowned upon. Are we okay with the module
parameter for this case?

>> +
>> +	sampler->em = panthor_perf_em_new();
>> +
>> +	mutex_init(&sampler->sampler_lock);
>> +	mutex_init(&sampler->config_lock);
>> +	INIT_LIST_HEAD(&sampler->sampler_list);
>> +	INIT_LIST_HEAD(&sampler->ems);
>> +	init_completion(&sampler->sample_handled);
>> +
>> +	sampler->ptdev = ptdev;
>> +
>> +	return 0;
>> +
>> +cleanup_vmap:
>> +	panthor_kernel_bo_vunmap(bo);
>> +
>> +cleanup_bo:
>> +	panthor_kernel_bo_destroy(bo);
>> +
>> +	return ret;
>> +}
>> +
>> +static void panthor_perf_sampler_term(struct panthor_perf_sampler *sampler)
>> +{
>> +	int ret;
>> +
>> +	if (sampler->sample_requested)
>> +		wait_for_completion_killable(&sampler->sample_handled);
> 
> Wouldn't it be better to wait until a certain deadline?
> 

Not entirely sure what a reasonable deadline could be. I would expect
not to hit this case in practice, since this would involve driver
teardown while a process is actively trying to sample, which
does not seem likely.

>> +	panthor_perf_fw_write_em(sampler, &(struct panthor_perf_enable_masks) {});
> 
> I would handle this through a little helper and call it panthor_perf_fw_zero_em
> because other calls of panthor_perf_fw_write_em() always feature the same arguments.
> 
>> +	ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
>> +	if (ret)
>> +		drm_warn_once(&sampler->ptdev->base, "Sampler termination failed, ret = %d", ret);
>> +
>> +	devm_kfree(sampler->ptdev->base.dev, sampler->sample);
>> +
>> +	panthor_kernel_bo_destroy(sampler->rb);
>> +}
>> +
>> +static int panthor_perf_sampler_add(struct panthor_perf_sampler *sampler,
>> +		struct panthor_perf_enable_masks *const new_em,
>> +		u8 set)
>> +{
>> +	int ret = 0;
>> +
>> +	guard(mutex)(&sampler->config_lock);
>> +
>> +	/* Early check for whether a new set can be configured. */
>> +	if (!atomic_read(&sampler->enabled_clients))
>> +		sampler->set_config = set;
>> +	else
>> +		if (sampler->set_config != set)
>> +			return -EBUSY;
>> +
>> +	kref_get(&new_em->refs);
>> +	list_add_tail(&sampler->ems, &new_em->link);
>> +
>> +	panthor_perf_em_add(sampler->em, new_em);
>> +	pm_runtime_get_sync(sampler->ptdev->base.dev);
>> +
>> +	if (atomic_read(&sampler->enabled_clients)) {
>> +		ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
>> +		if (ret)
>> +			return ret;
> 
> What happens in the case of a manual session sample? Does that mean it'll get
> interrupted? I guess if we use eventfd it's not a problem, because UM will be
> notified of when the sample it's made available?
> 

It should not get interrupted. We would lose a little bit of data in
the period of time between starting and stopping to sample, but
that is about the best we can do.

>> +	}
>> +
>> +	panthor_perf_fw_write_em(sampler, sampler->em);
> 
> Almost every single use of panthor_perf_fw_write_em is called with
> the same arguments, so I would define a helper that passes sampler->em underneath,
> maybe something like panthor_perf_fw_enable_sampler_mask
> 
>> +	ret = panthor_perf_fw_start_sampling(sampler->ptdev);
>> +	if (ret)
>> +		return ret;
> 
> iWwouldn't you do this only if if (atomic_read(&sampler->enabled_clients) > 0) ?
> Unless you want to start sampling now rather than waiting for session start.
> 
>> +	atomic_inc(&sampler->enabled_clients);
>> +
>> +	return 0;
>> +}
>> +
>> +static int panthor_perf_sampler_remove(struct panthor_perf_sampler *sampler,
>> +		struct panthor_perf_enable_masks *session_em)
> 
> I might prefer to call this sampler_remove_session.

Ack.
> 
>> +{
>> +	int ret;
>> +	struct list_head *em_node;
>> +
>> +	guard(mutex)(&sampler->config_lock);
>> +
>> +	list_del_init(&session_em->link);
>> +	kref_put(&session_em->refs, panthor_perf_destroy_em_kref);
>> +
>> +	panthor_perf_em_zero(sampler->em);
>> +	list_for_each(em_node, &sampler->ems)
>> +	{
>> +		struct panthor_perf_enable_masks *curr_em =
>> +			container_of(em_node, typeof(*curr_em), link);
>> +
>> +		panthor_perf_em_add(sampler->em, curr_em);
>> +	}
>> +
>> +	ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
>> +	if (ret)
>> +		return ret;
>> +
>> +	atomic_dec(&sampler->enabled_clients);
>> +	pm_runtime_put_sync(sampler->ptdev->base.dev);
>> +
>> +	panthor_perf_fw_write_em(sampler, sampler->em);
>> +
>> +	if (atomic_read(&sampler->enabled_clients))
>> +		return panthor_perf_fw_start_sampling(sampler->ptdev);
>> +	return 0;
>> +}
>> +
>>   /**
>>    * panthor_perf_init - Initialize the performance counter subsystem.
>>    * @ptdev: Panthor device
>> @@ -370,6 +1140,7 @@ static size_t session_get_max_sample_size(const struct drm_panthor_perf_info *co
>>   int panthor_perf_init(struct panthor_device *ptdev)
>>   {
>>   	struct panthor_perf *perf;
>> +	int ret;
>>   
>>   	if (!ptdev)
>>   		return -EINVAL;
>> @@ -386,12 +1157,93 @@ int panthor_perf_init(struct panthor_device *ptdev)
>>   		.max = 1,
>>   	};
>>   
>> +	ret = panthor_perf_sampler_init(&perf->sampler, ptdev);
>> +	if (ret)
>> +		goto cleanup_perf;
>> +
>>   	drm_info(&ptdev->base, "Performance counter subsystem initialized");
>>   
>>   	ptdev->perf = perf;
>>   
>> -	return 0;
>> +	return ret;
>> +
>> +cleanup_perf:
>> +	devm_kfree(ptdev->base.dev, perf);
>> +
>> +	return ret;
>> +}
>> +
> 
> Nit: spurious blank line
> 
>> +static void panthor_perf_fw_request_sample(struct panthor_perf_sampler *sampler)
>> +{
>> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(sampler->ptdev);
>> +
>> +	panthor_fw_toggle_reqs(glb_iface, req, ack, GLB_PERFCNT_SAMPLE);
>> +	gpu_write(sampler->ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
>> +}
> 
> Is there a way to reset the counters to zero without requesting a clearing sample?
>

There is no other way to establish a baseline so that newly added
sessions get a consistent view of the world. Note that this
clearing sample is effectively just a regular sample that
discards counters accumulated up until this point.

>> +/**
>> + * panthor_perf_sampler_request_clearing - Request a clearing sample.
>> + * @sampler: Panthor sampler
>> + *
>> + * Perform a synchronous sample that gets immediately discarded. This sets a baseline at the point
>> + * of time a new session is started, to avoid having counters from before the session.
>> + *
>> + */
>> +static int panthor_perf_sampler_request_clearing(struct panthor_perf_sampler *sampler)
>> +{
>> +	scoped_guard(mutex, &sampler->sampler_lock) {
>> +		if (!sampler->sample_requested) {
>> +			panthor_perf_fw_request_sample(sampler);
>> +			sampler->sample_requested = true;
>> +		}
>> +	}
>> +
>> +	return wait_for_completion_timeout(&sampler->sample_handled,
>> +			msecs_to_jiffies(1000));
>> +}
>> +
>> +/**
>> + * panthor_perf_sampler_request_sample - Request a counter sample for the userspace client.
>> + * @sampler: Panthor sampler
>> + * @session: Target session
>> + *
>> + * A session that has already requested a sample cannot request another one until the previous
>> + * sample has been delivered.
>> + *
>> + * Return:
>> + * * %0       - The sample has been requested successfully.
>> + * * %-EBUSY  - The target session has already requested a sample and has not received it yet.
>> + */
>> +static int panthor_perf_sampler_request_sample(struct panthor_perf_sampler *sampler,
>> +		struct panthor_perf_session *session)
>> +{
>> +	struct list_head *head;
>> +
>> +	reinit_completion(&sampler->sample_handled);
> 
> Shouldn't you wait until checking whether a sample has already been requested?

To reinitialize the completion? I think so.

> 
>> +	guard(mutex)(&sampler->sampler_lock);
>> +
>> +	/*
>> +	 * If a previous sample has not been handled yet, the session cannot request another
>> +	 * sample. If this happens too often, the requested sample rate is too high.
>> +	 */
>> +	list_for_each(head, &sampler->sampler_list) {
>> +		struct panthor_perf_session *cur_session = list_entry(head,
>> +				typeof(*cur_session), waiting);
>> +
>> +		if (session == cur_session)
>> +			return -EBUSY;
>> +	}
>> +
>> +	if (list_empty(&sampler->sampler_list) && !sampler->sample_requested)
>> +		panthor_perf_fw_request_sample(sampler);
>>   
>> +	sampler->sample_requested = true;
>> +	list_add_tail(&session->waiting, &sampler->sampler_list);
>> +	session_get(session);
>> +
>> +	return 0;
>>   }
>>   
>>   static int session_validate_set(u8 set)
>> @@ -483,7 +1335,12 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
>>   		goto cleanup_eventfd;
>>   	}
>>   
>> +	ret = panthor_perf_sampler_add(&perf->sampler, em, setup_args->block_set);
>> +	if (ret)
>> +		goto cleanup_em;
>> +
>>   	INIT_LIST_HEAD(&session->waiting);
>> +
>>   	session->extract_idx = ctrl_map.vaddr;
>>   	*session->extract_idx = 0;
>>   	session->insert_idx = session->extract_idx + 1;
>> @@ -507,12 +1364,15 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
>>   	ret = xa_alloc_cyclic(&perf->sessions, &session_id, session, perf->session_range,
>>   			&perf->next_session, GFP_KERNEL);
>>   	if (ret < 0)
>> -		goto cleanup_em;
>> +		goto cleanup_sampler_add;
>>   
>>   	kref_init(&session->ref);
>>   
>>   	return session_id;
>>   
>> +cleanup_sampler_add:
>> +	panthor_perf_sampler_remove(&perf->sampler, em);
>> +
>>   cleanup_em:
>>   	kref_put(&em->refs, panthor_perf_destroy_em_kref);
>>   
>> @@ -540,6 +1400,8 @@ int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf
>>   static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *session,
>>   		u64 user_data)
>>   {
>> +	int ret;
>> +
>>   	if (!test_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state))
>>   		return 0;
>>   
>> @@ -552,6 +1414,10 @@ static int session_stop(struct panthor_perf *perf, struct panthor_perf_session *
>>   
>>   	session->user_data = user_data;
>>   
>> +	ret = panthor_perf_sampler_request_sample(&perf->sampler, session);
>> +	if (ret)
>> +		return ret;
>> +
>>   	clear_bit(PANTHOR_PERF_SESSION_ACTIVE, session->state);
>>   
>>   	/* TODO Calls to the FW interface will go here in later patches. */
>> @@ -573,8 +1439,7 @@ static int session_start(struct panthor_perf *perf, struct panthor_perf_session
>>   	if (session->sample_freq_ns)
>>   		session->user_data = user_data;
>>   
>> -	/* TODO Calls to the FW interface will go here in later patches. */
>> -	return 0;
>> +	return panthor_perf_sampler_request_clearing(&perf->sampler);
>>   }
>>   
>>   static int session_sample(struct panthor_perf *perf, struct panthor_perf_session *session,
>> @@ -601,15 +1466,16 @@ static int session_sample(struct panthor_perf *perf, struct panthor_perf_session
>>   	session->sample_start_ns = ktime_get_raw_ns();
>>   	session->user_data = user_data;
>>   
>> -	/* TODO Calls to the FW interface will go here in later patches. */
>> -	return 0;
>> +	return panthor_perf_sampler_request_sample(&perf->sampler, session);
>>   }
>>   
>>   static int session_destroy(struct panthor_perf *perf, struct panthor_perf_session *session)
>>   {
>> +	int ret = panthor_perf_sampler_remove(&perf->sampler, session->enabled_counters);
>> +
>>   	session_put(session);
>>   
>> -	return 0;
>> +	return ret;
>>   }
>>   
>>   static int session_teardown(struct panthor_perf *perf, struct panthor_perf_session *session)
>> @@ -813,6 +1679,8 @@ void panthor_perf_unplug(struct panthor_device *ptdev)
>>   
>>   	xa_destroy(&perf->sessions);
>>   
>> +	panthor_perf_sampler_term(&perf->sampler);
>> +
>>   	devm_kfree(ptdev->base.dev, ptdev->perf);
>>   
>>   	ptdev->perf = NULL;
>> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
>> index bfef8874068b..3485e4a55e15 100644
>> --- a/drivers/gpu/drm/panthor/panthor_perf.h
>> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
>> @@ -31,4 +31,6 @@ int panthor_perf_session_sample(struct panthor_file *pfile, struct panthor_perf
>>   		u32 sid, u64 user_data);
>>   void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_perf *perf);
>>   
>> +void panthor_perf_report_irq(struct panthor_device *ptdev, u32 status);
>> +
>>   #endif /* __PANTHOR_PERF_H__ */
>> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
>> index 576d3ad46e6d..a29b755d6556 100644
>> --- a/include/uapi/drm/panthor_drm.h
>> +++ b/include/uapi/drm/panthor_drm.h
>> @@ -441,8 +441,11 @@ enum drm_panthor_perf_feat_flags {
>>    * enum drm_panthor_perf_block_type - Performance counter supported block types.
>>    */
>>   enum drm_panthor_perf_block_type {
>> +	/** DRM_PANTHOR_PERF_BLOCK_METADATA: Internal use only. */
>> +	DRM_PANTHOR_PERF_BLOCK_METADATA = 0,
>> +
>>   	/** @DRM_PANTHOR_PERF_BLOCK_FW: The FW counter block. */
>> -	DRM_PANTHOR_PERF_BLOCK_FW = 1,
>> +	DRM_PANTHOR_PERF_BLOCK_FW,
>>   
>>   	/** @DRM_PANTHOR_PERF_BLOCK_CSG: A CSG counter block. */
>>   	DRM_PANTHOR_PERF_BLOCK_CSG,
>> -- 
>> 2.25.1
> 
> 
> Adrian Larumbe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 7/8] drm/panthor: Add suspend/resume handling for the performance counters
  2025-01-27 20:06   ` Adrián Larumbe
@ 2025-03-27  8:57     ` Lukas Zapolskas
  0 siblings, 0 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2025-03-27  8:57 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd



On 27/01/2025 20:06, Adrián Larumbe wrote:
> On 11.12.2024 16:50, Lukas Zapolskas wrote:
>> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
>> ---
>>   drivers/gpu/drm/panthor/panthor_device.c |  3 +
>>   drivers/gpu/drm/panthor/panthor_perf.c   | 86 ++++++++++++++++++++++++
>>   drivers/gpu/drm/panthor/panthor_perf.h   |  2 +
>>   3 files changed, 91 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
>> index 1a81a436143b..69536fbdb5ef 100644
>> --- a/drivers/gpu/drm/panthor/panthor_device.c
>> +++ b/drivers/gpu/drm/panthor/panthor_device.c
>> @@ -475,6 +475,7 @@ int panthor_device_resume(struct device *dev)
>>   		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
>>   		if (!ret) {
>>   			panthor_sched_resume(ptdev);
>> +			panthor_perf_resume(ptdev);
>>   		} else {
>>   			panthor_mmu_suspend(ptdev);
>>   			panthor_gpu_suspend(ptdev);
>> @@ -543,6 +544,7 @@ int panthor_device_suspend(struct device *dev)
>>   	    drm_dev_enter(&ptdev->base, &cookie)) {
>>   		cancel_work_sync(&ptdev->reset.work);
>>   
>> +		panthor_perf_suspend(ptdev);
>>   		/* We prepare everything as if we were resetting the GPU.
>>   		 * The end of the reset will happen in the resume path though.
>>   		 */
>> @@ -561,6 +563,7 @@ int panthor_device_suspend(struct device *dev)
>>   			panthor_mmu_resume(ptdev);
>>   			drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
>>   			panthor_sched_resume(ptdev);
>> +			panthor_perf_resume(ptdev);
>>   			drm_dev_exit(cookie);
>>   		}
>>   
>> diff --git a/drivers/gpu/drm/panthor/panthor_perf.c b/drivers/gpu/drm/panthor/panthor_perf.c
>> index d62d97c448da..727e66074eab 100644
>> --- a/drivers/gpu/drm/panthor/panthor_perf.c
>> +++ b/drivers/gpu/drm/panthor/panthor_perf.c
>> @@ -433,6 +433,17 @@ static void panthor_perf_em_zero(struct panthor_perf_enable_masks *em)
>>   		bitmap_zero(em->mask[i], PANTHOR_PERF_EM_BITS);
>>   }
>>   
>> +static bool panthor_perf_em_empty(const struct panthor_perf_enable_masks *const em)
>> +{
>> +	bool empty = true;
>> +	size_t i = 0;
>> +
>> +	for (i = DRM_PANTHOR_PERF_BLOCK_FW; i <= DRM_PANTHOR_PERF_BLOCK_LAST; i++)
>> +		empty &= bitmap_empty(em->mask[i], PANTHOR_PERF_EM_BITS);
>> +
>> +	return empty;
>> +}
>> +
>>   static void panthor_perf_destroy_em_kref(struct kref *em_kref)
>>   {
>>   	struct panthor_perf_enable_masks *em = container_of(em_kref, typeof(*em), refs);
>> @@ -1652,6 +1663,81 @@ void panthor_perf_session_destroy(struct panthor_file *pfile, struct panthor_per
>>   	}
>>   }
>>   
>> +static int panthor_perf_sampler_resume(struct panthor_perf_sampler *sampler)
>> +{
>> +	int ret;
>> +
>> +	if (!atomic_read(&sampler->enabled_clients))
>> +		return 0;
>> +
>> +	if (!panthor_perf_em_empty(sampler->em)) {
>> +		guard(mutex)(&sampler->config_lock);
>> +		panthor_perf_fw_write_em(sampler, sampler->em);
>> +	}
> 
> Aren't panthor_perf_em_empty(sampler->em) and !atomic_read(&sampler->enabled_clients) functionally equivalent?
> 
Hadn't thought about that before, but it may be the case. It
makes a slight difference for adding a new session to the
sampler, where we need to keep track of both the
previous and the current mask, as well as removing a session,
where the order of operation becomes a little awkward if
we use them to mean the same thing.

The sampler's enable mask is seen as somewhat disposable
in the case of removing a session, since we cannot just
remove the counters requested by that session and be done
with it. This would lead to counters that are requested
by other sessions being deleted. So we zero out the
enable mask and then recreate it using all of the enable
masks from the other sessions.

>> +
>> +	ret = panthor_perf_fw_start_sampling(sampler->ptdev);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return 0;
>> +}
>> +
>> +static int panthor_perf_sampler_suspend(struct panthor_perf_sampler *sampler)
>> +{
>> +	int ret;
>> +
>> +	if (!atomic_read(&sampler->enabled_clients))
>> +		return 0;
>> +
>> +	ret = panthor_perf_fw_stop_sampling(sampler->ptdev);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * panthor_perf_suspend - Prepare the performance counter subsystem for system suspend.
>> + * @ptdev: Panthor device.
>> + *
>> + * Indicate to the performance counters that the system is suspending.
>> + *
>> + * This function must not be used to handle MCU power state transitions: just before MCU goes
>> + * from on to any inactive state, an automatic sample will be performed by the firmware, and
>> + * the performance counter firmware state will be restored on warm boot.
>> + *
>> + * Return: 0 on success, negative error code on failure.
>> + */
>> +int panthor_perf_suspend(struct panthor_device *ptdev)
>> +{
>> +	struct panthor_perf *perf = ptdev->perf;
>> +
>> +	if (!perf)
>> +		return 0;
>> +
>> +	return panthor_perf_sampler_suspend(&perf->sampler);
>> +}
>> +
>> +/**
>> + * panthor_perf_resume - Resume the performance counter subsystem after system resumption.
>> + * @ptdev: Panthor device.
>> + *
>> + * Indicate to the performance counters that the system has resumed. This must not be used
>> + * to handle MCU state transitions, for the same reasons as detailed in the kerneldoc for
>> + * @panthor_perf_suspend.
>> + *
>> + * Return: 0 on success, negative error code on failure.
>> + */
>> +int panthor_perf_resume(struct panthor_device *ptdev)
>> +{
>> +	struct panthor_perf *perf = ptdev->perf;
>> +
>> +	if (!perf)
>> +		return 0;
>> +
>> +	return panthor_perf_sampler_resume(&perf->sampler);
>> +}
>> +
>>   /**
>>    * panthor_perf_unplug - Terminate the performance counter subsystem.
>>    * @ptdev: Panthor device.
>> diff --git a/drivers/gpu/drm/panthor/panthor_perf.h b/drivers/gpu/drm/panthor/panthor_perf.h
>> index 3485e4a55e15..a22a511a0809 100644
>> --- a/drivers/gpu/drm/panthor/panthor_perf.h
>> +++ b/drivers/gpu/drm/panthor/panthor_perf.h
>> @@ -16,6 +16,8 @@ struct panthor_perf;
>>   void panthor_perf_info_init(struct panthor_device *ptdev);
>>   
>>   int panthor_perf_init(struct panthor_device *ptdev);
>> +int panthor_perf_suspend(struct panthor_device *ptdev);
>> +int panthor_perf_resume(struct panthor_device *ptdev);
>>   void panthor_perf_unplug(struct panthor_device *ptdev);
>>   
>>   int panthor_perf_session_setup(struct panthor_device *ptdev, struct panthor_perf *perf,
>> -- 
>> 2.25.1
> 
> Adrian Larumbe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC v2 8/8] drm/panthor: Expose the panthor perf ioctls
  2025-01-27 20:14   ` Adrián Larumbe
@ 2025-03-27  8:58     ` Lukas Zapolskas
  0 siblings, 0 replies; 28+ messages in thread
From: Lukas Zapolskas @ 2025-03-27  8:58 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	dri-devel, linux-kernel, Mihail Atanassov, nd



On 27/01/2025 20:14, Adrián Larumbe wrote:
> I don't know what the usual practice is when adding a new DRM driver ioctl(), but wouldn't it make
> more sense to add the PERF_CONTROL one to the panthor_drm_driver_ioctls array in this patch instead?
>

That does make more sense, I'll shuffle the patches around.


> Other than that:
> 
> Reviewed-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> 
> On 11.12.2024 16:50, Lukas Zapolskas wrote:
>> Signed-off-by: Lukas Zapolskas <lukas.zapolskas@arm.com>
>> ---
>>   drivers/gpu/drm/panthor/panthor_drv.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
>> index 2848ab442d10..ef081a383fa9 100644
>> --- a/drivers/gpu/drm/panthor/panthor_drv.c
>> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
>> @@ -1654,6 +1654,8 @@ static void panthor_debugfs_init(struct drm_minor *minor)
>>    * - 1.1 - adds DEV_QUERY_TIMESTAMP_INFO query
>>    * - 1.2 - adds DEV_QUERY_GROUP_PRIORITIES_INFO query
>>    *       - adds PANTHOR_GROUP_PRIORITY_REALTIME priority
>> + * - 1.3 - adds DEV_QUERY_PERF_INFO query
>> + *         adds PERF_CONTROL ioctl
>>    */
>>   static const struct drm_driver panthor_drm_driver = {
>>   	.driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
>> @@ -1667,7 +1669,7 @@ static const struct drm_driver panthor_drm_driver = {
>>   	.name = "panthor",
>>   	.desc = "Panthor DRM driver",
>>   	.major = 1,
>> -	.minor = 2,
>> +	.minor = 3,
>>   
>>   	.gem_create_object = panthor_gem_create_object,
>>   	.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
>> -- 
>> 2.25.1
> 
> 
> Adrian Larumbe

Thank you very much for taking a look! I'm working on addressing
the comments you left and hoping to get a v3 up soon.

Kind regards,
Lukas Zapolskas


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-03-27  8:59 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11 16:50 [RFC v2 0/8] drm/panthor: Add performance counters with manual sampling mode Lukas Zapolskas
2024-12-11 16:50 ` [RFC v2 1/8] drm/panthor: Add performance counter uAPI Lukas Zapolskas
2025-01-27  9:47   ` Adrián Larumbe
2025-03-26 14:24     ` Lukas Zapolskas
2024-12-11 16:50 ` [RFC v2 2/8] drm/panthor: Add DEV_QUERY.PERF_INFO handling for Gx10 Lukas Zapolskas
2025-01-27  9:56   ` Adrián Larumbe
2025-01-27 22:17   ` Adrián Larumbe
2024-12-11 16:50 ` [RFC v2 3/8] drm/panthor: Add panthor_perf_init and panthor_perf_unplug Lukas Zapolskas
2025-01-27 12:46   ` Adrián Larumbe
2025-03-26 14:36     ` Lukas Zapolskas
2025-01-27 15:50   ` adrian.larumbe
2024-12-11 16:50 ` [RFC v2 4/8] drm/panthor: Add panthor perf ioctls Lukas Zapolskas
2025-01-27 14:06   ` Adrián Larumbe
2025-03-26 14:40     ` Lukas Zapolskas
2024-12-11 16:50 ` [RFC v2 5/8] drm/panthor: Introduce sampling sessions to handle userspace clients Lukas Zapolskas
2025-01-27 15:43   ` Adrián Larumbe
2025-03-26 15:14     ` Lukas Zapolskas
2025-01-27 21:39   ` Adrián Larumbe
2024-12-11 16:50 ` [RFC v2 6/8] drm/panthor: Implement the counter sampler and sample handling Lukas Zapolskas
2025-01-27 16:53   ` Adrián Larumbe
2025-03-27  8:53     ` Lukas Zapolskas
2025-01-27 21:09   ` Adrián Larumbe
2024-12-11 16:50 ` [RFC v2 7/8] drm/panthor: Add suspend/resume handling for the performance counters Lukas Zapolskas
2025-01-27 20:06   ` Adrián Larumbe
2025-03-27  8:57     ` Lukas Zapolskas
2024-12-11 16:50 ` [RFC v2 8/8] drm/panthor: Expose the panthor perf ioctls Lukas Zapolskas
2025-01-27 20:14   ` Adrián Larumbe
2025-03-27  8:58     ` Lukas Zapolskas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox