[PATCH v5 0/4] Support fdinfo runtime and memory stats on Panthor

public inbox for linux-media@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/4] Support fdinfo runtime and memory stats on Panthor
@ 2024-09-03 20:25 Adrián Larumbe
  2024-09-03 20:25 ` [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Adrián Larumbe @ 2024-09-03 20:25 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter,
	Sumit Semwal, Christian König
  Cc: kernel, Adrián Larumbe, dri-devel, linux-kernel, linux-media,
	linaro-mm-sig

This patch series enables userspace utilities like gputop and nvtop to
query a render context's fdinfo file and figure out rates of engine
and memory utilisation.

Previous discussion can be found at
https://lore.kernel.org/dri-devel/20240716201302.2939894-1-adrian.larumbe@collabora.com/

Changelog:
v5:
 - Moved profiling information slots into a per-queue BO and away from syncobjs.
 - Decide on size of profiling slots BO from size of CS for minimal profiled job
 - Turn job and device profiling flag into a bit mask so that individual metrics
 can be enabled separately.
 - Shrunk ringbuffer slot size to that of a cache line.
 - Track profiling slot indeces separately from the job's queue ringbuffer's
 - Emit CS instructions one by one and tag them depending on profiling mask
 - New helper for calculating job credits depending on profiling flags
 - Add Documentation file for sysfs profiling knob
 - fdinfo will only show engines or cycles tags if these are respectively enabled.
v4:
 - Fixed wrong assignment location for frequency values in Panthor's devfreq
 - Removed the last two commits about registering size of internal BO's
 - Rearranged patch series so that sysfs knob is done last and all the previous
 time sampling and fdinfo show dependencies are already in place
v3:
 - Fixed some nits and removed useless bounds check in panthor_sched.c
 - Added support for sysfs profiling knob and optional job accounting
 - Added new patches for calculating size of internal BO's
v2:
 - Split original first patch in two, one for FW CS cycle and timestamp
 calculations and job accounting memory management, and a second one
 that enables fdinfo.
 - Moved NUM_INSTRS_PER_SLOT to the file prelude
 - Removed nelem variable from the group's struct definition.
 - Precompute size of group's syncobj BO to avoid code duplication.
 - Some minor nits.

Adrián Larumbe (4):
  drm/panthor: introduce job cycle and timestamp accounting
  drm/panthor: add DRM fdinfo support
  drm/panthor: enable fdinfo for memory stats
  drm/panthor: add sysfs knob for enabling job profiling

 Documentation/gpu/panthor.rst             |  46 +++
 drivers/gpu/drm/panthor/panthor_devfreq.c |  18 +-
 drivers/gpu/drm/panthor/panthor_device.h  |  36 +++
 drivers/gpu/drm/panthor/panthor_drv.c     |  74 +++++
 drivers/gpu/drm/panthor/panthor_gem.c     |  12 +
 drivers/gpu/drm/panthor/panthor_sched.c   | 372 +++++++++++++++++++---
 6 files changed, 513 insertions(+), 45 deletions(-)
 create mode 100644 Documentation/gpu/panthor.rst

-- 
2.46.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-09-03 20:25 [PATCH v5 0/4] Support fdinfo runtime and memory stats on Panthor Adrián Larumbe
@ 2024-09-03 20:25 ` Adrián Larumbe
  2024-09-04  7:49   ` Boris Brezillon
                     ` (3 more replies)
  2024-09-03 20:25 ` [PATCH v5 2/4] drm/panthor: add DRM fdinfo support Adrián Larumbe
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 15+ messages in thread
From: Adrián Larumbe @ 2024-09-03 20:25 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter,
	Sumit Semwal, Christian König
  Cc: kernel, Adrián Larumbe, dri-devel, linux-kernel, linux-media,
	linaro-mm-sig

Enable calculations of job submission times in clock cycles and wall
time. This is done by expanding the boilerplate command stream when running
a job to include instructions that compute said times right before an after
a user CS.

A separate kernel BO is created per queue to store those values. Jobs can
access their sampled data through a slots buffer-specific index different
from that of the queue's ringbuffer. The reason for this is saving memory
on the profiling information kernel BO, since the amount of simultaneous
profiled jobs we can write into the queue's ringbuffer might be much
smaller than for regular jobs, as the former take more CSF instructions.

This commit is done in preparation for enabling DRM fdinfo support in the
Panthor driver, which depends on the numbers calculated herein.

A profile mode mask has been added that will in a future commit allow UM to
toggle performance metric sampling behaviour, which is disabled by default
to save power. When a ringbuffer CS is constructed, timestamp and cycling
sampling instructions are added depending on the enabled flags in the
profiling mask.

A helper was provided that calculates the number of instructions for a
given set of enablement mask, and these are passed as the number of credits
when initialising a DRM scheduler job.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_device.h |  22 ++
 drivers/gpu/drm/panthor/panthor_sched.c  | 327 ++++++++++++++++++++---
 2 files changed, 305 insertions(+), 44 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index e388c0472ba7..a48e30d0af30 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -66,6 +66,25 @@ struct panthor_irq {
 	atomic_t suspended;
 };
 
+/**
+ * enum panthor_device_profiling_mode - Profiling state
+ */
+enum panthor_device_profiling_flags {
+	/** @PANTHOR_DEVICE_PROFILING_DISABLED: Profiling is disabled. */
+	PANTHOR_DEVICE_PROFILING_DISABLED = 0,
+
+	/** @PANTHOR_DEVICE_PROFILING_CYCLES: Sampling job cycles. */
+	PANTHOR_DEVICE_PROFILING_CYCLES = BIT(0),
+
+	/** @PANTHOR_DEVICE_PROFILING_TIMESTAMP: Sampling job timestamp. */
+	PANTHOR_DEVICE_PROFILING_TIMESTAMP = BIT(1),
+
+	/** @PANTHOR_DEVICE_PROFILING_ALL: Sampling everything. */
+	PANTHOR_DEVICE_PROFILING_ALL =
+	PANTHOR_DEVICE_PROFILING_CYCLES |
+	PANTHOR_DEVICE_PROFILING_TIMESTAMP,
+};
+
 /**
  * struct panthor_device - Panthor device
  */
@@ -162,6 +181,9 @@ struct panthor_device {
 		 */
 		struct page *dummy_latest_flush;
 	} pm;
+
+	/** @profile_mask: User-set profiling flags for job accounting. */
+	u32 profile_mask;
 };
 
 /**
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index c426a392b081..b087648bf59a 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -93,6 +93,9 @@
 #define MIN_CSGS				3
 #define MAX_CSG_PRIO				0xf
 
+#define NUM_INSTRS_PER_CACHE_LINE		(64 / sizeof(u64))
+#define MAX_INSTRS_PER_JOB			32
+
 struct panthor_group;
 
 /**
@@ -476,6 +479,18 @@ struct panthor_queue {
 		 */
 		struct list_head in_flight_jobs;
 	} fence_ctx;
+
+	/** @profiling_info: Job profiling data slots and access information. */
+	struct {
+		/** @slots: Kernel BO holding the slots. */
+		struct panthor_kernel_bo *slots;
+
+		/** @slot_count: Number of jobs ringbuffer can hold at once. */
+		u32 slot_count;
+
+		/** @profiling_seqno: Index of the next available profiling information slot. */
+		u32 profiling_seqno;
+	} profiling_info;
 };
 
 /**
@@ -661,6 +676,18 @@ struct panthor_group {
 	struct list_head wait_node;
 };
 
+struct panthor_job_profiling_data {
+	struct {
+		u64 before;
+		u64 after;
+	} cycles;
+
+	struct {
+		u64 before;
+		u64 after;
+	} time;
+};
+
 /**
  * group_queue_work() - Queue a group work
  * @group: Group to queue the work for.
@@ -774,6 +801,12 @@ struct panthor_job {
 
 	/** @done_fence: Fence signaled when the job is finished or cancelled. */
 	struct dma_fence *done_fence;
+
+	/** @profile_mask: Current device job profiling enablement bitmask. */
+	u32 profile_mask;
+
+	/** @profile_slot: Job profiling information index in the profiling slots BO. */
+	u32 profiling_slot;
 };
 
 static void
@@ -838,6 +871,7 @@ static void group_free_queue(struct panthor_group *group, struct panthor_queue *
 
 	panthor_kernel_bo_destroy(queue->ringbuf);
 	panthor_kernel_bo_destroy(queue->iface.mem);
+	panthor_kernel_bo_destroy(queue->profiling_info.slots);
 
 	/* Release the last_fence we were holding, if any. */
 	dma_fence_put(queue->fence_ctx.last_fence);
@@ -1982,8 +2016,6 @@ tick_ctx_init(struct panthor_scheduler *sched,
 	}
 }
 
-#define NUM_INSTRS_PER_SLOT		16
-
 static void
 group_term_post_processing(struct panthor_group *group)
 {
@@ -2815,65 +2847,211 @@ static void group_sync_upd_work(struct work_struct *work)
 	group_put(group);
 }
 
-static struct dma_fence *
-queue_run_job(struct drm_sched_job *sched_job)
+struct panthor_job_ringbuf_instrs {
+	u64 buffer[MAX_INSTRS_PER_JOB];
+	u32 count;
+};
+
+struct panthor_job_instr {
+	u32 profile_mask;
+	u64 instr;
+};
+
+#define JOB_INSTR(__prof, __instr) \
+	{ \
+		.profile_mask = __prof, \
+		.instr = __instr, \
+	}
+
+static void
+copy_instrs_to_ringbuf(struct panthor_queue *queue,
+		       struct panthor_job *job,
+		       struct panthor_job_ringbuf_instrs *instrs)
+{
+	ssize_t ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
+	u32 start = job->ringbuf.start & (ringbuf_size - 1);
+	ssize_t size, written;
+
+	/*
+	 * We need to write a whole slot, including any trailing zeroes
+	 * that may come at the end of it. Also, because instrs.buffer had
+	 * been zero-initialised, there's no need to pad it with 0's
+	 */
+	instrs->count = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE);
+	size = instrs->count * sizeof(u64);
+	written = min(ringbuf_size - start, size);
+
+	memcpy(queue->ringbuf->kmap + start, instrs->buffer, written);
+
+	if (written < size)
+		memcpy(queue->ringbuf->kmap,
+		       &instrs->buffer[(ringbuf_size - start)/sizeof(u64)],
+		       size - written);
+}
+
+struct panthor_job_cs_params {
+	u32 profile_mask;
+	u64 addr_reg; u64 val_reg;
+	u64 cycle_reg; u64 time_reg;
+	u64 sync_addr; u64 times_addr;
+	u64 cs_start; u64 cs_size;
+	u32 last_flush; u32 waitall_mask;
+};
+
+static void
+get_job_cs_params(struct panthor_job *job, struct panthor_job_cs_params *params)
 {
-	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
 	struct panthor_group *group = job->group;
 	struct panthor_queue *queue = group->queues[job->queue_idx];
 	struct panthor_device *ptdev = group->ptdev;
 	struct panthor_scheduler *sched = ptdev->scheduler;
-	u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
-	u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1);
-	u64 addr_reg = ptdev->csif_info.cs_reg_count -
-		       ptdev->csif_info.unpreserved_cs_reg_count;
-	u64 val_reg = addr_reg + 2;
-	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
-			job->queue_idx * sizeof(struct panthor_syncobj_64b);
-	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
-	struct dma_fence *done_fence;
-	int ret;
 
-	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
+	params->addr_reg = ptdev->csif_info.cs_reg_count -
+			   ptdev->csif_info.unpreserved_cs_reg_count;
+	params->val_reg = params->addr_reg + 2;
+	params->cycle_reg = params->addr_reg;
+	params->time_reg = params->val_reg;
+
+	params->sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
+			    job->queue_idx * sizeof(struct panthor_syncobj_64b);
+	params->times_addr = panthor_kernel_bo_gpuva(queue->profiling_info.slots) +
+			     (job->profiling_slot * sizeof(struct panthor_job_profiling_data));
+	params->waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
+
+	params->cs_start = job->call_info.start;
+	params->cs_size = job->call_info.size;
+	params->last_flush = job->call_info.latest_flush;
+
+	params->profile_mask = job->profile_mask;
+}
+
+static void
+prepare_job_instrs(const struct panthor_job_cs_params *params,
+		   struct panthor_job_ringbuf_instrs *instrs)
+{
+	const struct panthor_job_instr instr_seq[] = {
 		/* MOV32 rX+2, cs.latest_flush */
-		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
+			  (2ull << 56) | (params->val_reg << 48) | params->last_flush),
 
 		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
-		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
+			  (36ull << 56) | (0ull << 48) | (params->val_reg << 40) | (0 << 16) | 0x233),
+
+		/* MOV48 rX:rX+1, cycles_offset */
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
+			  (1ull << 56) | (params->cycle_reg << 48) |
+			  (params->times_addr +
+			   offsetof(struct panthor_job_profiling_data, cycles.before))),
+		/* STORE_STATE cycles */
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
+			  (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)),
+		/* MOV48 rX:rX+1, time_offset */
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
+			  (1ull << 56) | (params->time_reg << 48) |
+			  (params->times_addr +
+			   offsetof(struct panthor_job_profiling_data, time.before))),
+		/* STORE_STATE timer */
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
+			  (40ull << 56) | (params->time_reg << 40) | (0ll << 32)),
 
 		/* MOV48 rX:rX+1, cs.start */
-		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
-
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
+			  (1ull << 56) | (params->addr_reg << 48) | params->cs_start),
 		/* MOV32 rX+2, cs.size */
-		(2ull << 56) | (val_reg << 48) | job->call_info.size,
-
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
+			  (2ull << 56) | (params->val_reg << 48) | params->cs_size),
 		/* WAIT(0) => waits for FLUSH_CACHE2 instruction */
-		(3ull << 56) | (1 << 16),
-
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (3ull << 56) | (1 << 16)),
 		/* CALL rX:rX+1, rX+2 */
-		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
+			  (32ull << 56) | (params->addr_reg << 40) | (params->val_reg << 32)),
+
+		/* MOV48 rX:rX+1, cycles_offset */
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
+			  (1ull << 56) | (params->cycle_reg << 48) |
+			  (params->times_addr +
+			   offsetof(struct panthor_job_profiling_data, cycles.after))),
+		/* STORE_STATE cycles */
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
+			  (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)),
+
+		/* MOV48 rX:rX+1, time_offset */
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
+			  (1ull << 56) | (params->time_reg << 48) |
+			  (params->times_addr +
+			   offsetof(struct panthor_job_profiling_data, time.after))),
+		/* STORE_STATE timer */
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
+			  (40ull << 56) | (params->time_reg << 40) | (0ll << 32)),
 
 		/* MOV48 rX:rX+1, sync_addr */
-		(1ull << 56) | (addr_reg << 48) | sync_addr,
-
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
+			  (1ull << 56) | (params->addr_reg << 48) | params->sync_addr),
 		/* MOV48 rX+2, #1 */
-		(1ull << 56) | (val_reg << 48) | 1,
-
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
+			  (1ull << 56) | (params->val_reg << 48) | 1),
 		/* WAIT(all) */
-		(3ull << 56) | (waitall_mask << 16),
-
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
+			  (3ull << 56) | (params->waitall_mask << 16)),
 		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
-		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
-
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
+			  (51ull << 56) | (0ull << 48) | (params->addr_reg << 40) |
+			  (params->val_reg << 32) | (0 << 16) | 1),
 		/* ERROR_BARRIER, so we can recover from faults at job
 		 * boundaries.
 		 */
-		(47ull << 56),
+		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (47ull << 56)),
+	};
+	u32 pad;
+
+	/* NEED to be cacheline aligned to please the prefetcher. */
+	static_assert(sizeof(instrs->buffer) % 64 == 0,
+		      "panthor_job_ringbuf_instrs::buffer is not aligned on a cacheline");
+
+	/* Make sure we have enough storage to store the whole sequence. */
+	static_assert(ALIGN(ARRAY_SIZE(instr_seq), NUM_INSTRS_PER_CACHE_LINE) <=
+		      ARRAY_SIZE(instrs->buffer),
+		      "instr_seq vs panthor_job_ringbuf_instrs::buffer size mismatch");
+
+	for (u32 i = 0; i < ARRAY_SIZE(instr_seq); i++) {
+		/* If the profile mask of this instruction is not enabled, skip it. */
+		if (instr_seq[i].profile_mask &&
+		    !(instr_seq[i].profile_mask & params->profile_mask))
+			continue;
+
+		instrs->buffer[instrs->count++] = instr_seq[i].instr;
+	}
+
+	pad = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE);
+	memset(&instrs->buffer[instrs->count], 0,
+	       (pad - instrs->count) * sizeof(instrs->buffer[0]));
+	instrs->count = pad;
+}
+
+static u32 calc_job_credits(u32 profile_mask)
+{
+	struct panthor_job_ringbuf_instrs instrs;
+	struct panthor_job_cs_params params = {
+		.profile_mask = profile_mask,
 	};
 
-	/* Need to be cacheline aligned to please the prefetcher. */
-	static_assert(sizeof(call_instrs) % 64 == 0,
-		      "call_instrs is not aligned on a cacheline");
+	prepare_job_instrs(&params, &instrs);
+	return instrs.count;
+}
+
+static struct dma_fence *
+queue_run_job(struct drm_sched_job *sched_job)
+{
+	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+	struct panthor_group *group = job->group;
+	struct panthor_queue *queue = group->queues[job->queue_idx];
+	struct panthor_device *ptdev = group->ptdev;
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_job_ringbuf_instrs instrs;
+	struct panthor_job_cs_params cs_params;
+	struct dma_fence *done_fence;
+	int ret;
 
 	/* Stream size is zero, nothing to do except making sure all previously
 	 * submitted jobs are done before we signal the
@@ -2900,17 +3078,23 @@ queue_run_job(struct drm_sched_job *sched_job)
 		       queue->fence_ctx.id,
 		       atomic64_inc_return(&queue->fence_ctx.seqno));
 
-	memcpy(queue->ringbuf->kmap + ringbuf_insert,
-	       call_instrs, sizeof(call_instrs));
+	job->profiling_slot = queue->profiling_info.profiling_seqno++;
+	if (queue->profiling_info.profiling_seqno == queue->profiling_info.slot_count)
+		queue->profiling_info.profiling_seqno = 0;
+
+	job->ringbuf.start = queue->iface.input->insert;
+
+	get_job_cs_params(job, &cs_params);
+	prepare_job_instrs(&cs_params, &instrs);
+	copy_instrs_to_ringbuf(queue, job, &instrs);
+
+	job->ringbuf.end = job->ringbuf.start + (instrs.count * sizeof(u64));
 
 	panthor_job_get(&job->base);
 	spin_lock(&queue->fence_ctx.lock);
 	list_add_tail(&job->node, &queue->fence_ctx.in_flight_jobs);
 	spin_unlock(&queue->fence_ctx.lock);
 
-	job->ringbuf.start = queue->iface.input->insert;
-	job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
-
 	/* Make sure the ring buffer is updated before the INSERT
 	 * register.
 	 */
@@ -3003,6 +3187,24 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
 	.free_job = queue_free_job,
 };
 
+static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
+				       u32 cs_ringbuf_size)
+{
+	u32 min_profiled_job_instrs = U32_MAX;
+	u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
+
+	for (u32 i = 0; i < last_flag; i++) {
+		if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
+			min_profiled_job_instrs =
+				min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
+	}
+
+	drm_WARN_ON(&ptdev->base,
+		    !IS_ALIGNED(min_profiled_job_instrs, NUM_INSTRS_PER_CACHE_LINE));
+
+	return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
+}
+
 static struct panthor_queue *
 group_create_queue(struct panthor_group *group,
 		   const struct drm_panthor_queue_create *args)
@@ -3056,9 +3258,38 @@ group_create_queue(struct panthor_group *group,
 		goto err_free_queue;
 	}
 
+	queue->profiling_info.slot_count =
+		calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
+
+	queue->profiling_info.slots =
+		panthor_kernel_bo_create(group->ptdev, group->vm,
+					 queue->profiling_info.slot_count *
+					 sizeof(struct panthor_job_profiling_data),
+					 DRM_PANTHOR_BO_NO_MMAP,
+					 DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
+					 DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
+					 PANTHOR_VM_KERNEL_AUTO_VA);
+
+	if (IS_ERR(queue->profiling_info.slots)) {
+		ret = PTR_ERR(queue->profiling_info.slots);
+		goto err_free_queue;
+	}
+
+	ret = panthor_kernel_bo_vmap(queue->profiling_info.slots);
+	if (ret)
+		goto err_free_queue;
+
+	memset(queue->profiling_info.slots->kmap, 0,
+	       queue->profiling_info.slot_count * sizeof(struct panthor_job_profiling_data));
+
+	/*
+	 * Credit limit argument tells us the total number of instructions
+	 * across all CS slots in the ringbuffer, with some jobs requiring
+	 * twice as many as others, depending on their profiling status.
+	 */
 	ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
 			     group->ptdev->scheduler->wq, 1,
-			     args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
+			     args->ringbuf_size / sizeof(u64),
 			     0, msecs_to_jiffies(JOB_TIMEOUT_MS),
 			     group->ptdev->reset.wq,
 			     NULL, "panthor-queue", group->ptdev->base.dev);
@@ -3354,6 +3585,7 @@ panthor_job_create(struct panthor_file *pfile,
 {
 	struct panthor_group_pool *gpool = pfile->groups;
 	struct panthor_job *job;
+	u32 credits;
 	int ret;
 
 	if (qsubmit->pad)
@@ -3407,9 +3639,16 @@ panthor_job_create(struct panthor_file *pfile,
 		}
 	}
 
+	job->profile_mask = pfile->ptdev->profile_mask;
+	credits = calc_job_credits(job->profile_mask);
+	if (credits == 0) {
+		ret = -EINVAL;
+		goto err_put_job;
+	}
+
 	ret = drm_sched_job_init(&job->base,
 				 &job->group->queues[job->queue_idx]->entity,
-				 1, job->group);
+				 credits, job->group);
 	if (ret)
 		goto err_put_job;
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 2/4] drm/panthor: add DRM fdinfo support
  2024-09-03 20:25 [PATCH v5 0/4] Support fdinfo runtime and memory stats on Panthor Adrián Larumbe
  2024-09-03 20:25 ` [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
@ 2024-09-03 20:25 ` Adrián Larumbe
  2024-09-04  8:01   ` Boris Brezillon
  2024-09-04 17:45   ` kernel test robot
  2024-09-03 20:25 ` [PATCH v5 3/4] drm/panthor: enable fdinfo for memory stats Adrián Larumbe
  2024-09-03 20:25 ` [PATCH v5 4/4] drm/panthor: add sysfs knob for enabling job profiling Adrián Larumbe
  3 siblings, 2 replies; 15+ messages in thread
From: Adrián Larumbe @ 2024-09-03 20:25 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter,
	Sumit Semwal, Christian König
  Cc: kernel, Adrián Larumbe, dri-devel, linux-kernel, linux-media,
	linaro-mm-sig

Drawing from the FW-calculated values in the previous commit, we can
increase the numbers for an open file by collecting them from finished jobs
when updating their group synchronisation objects.

Display of fdinfo key-value pairs is governed by a flag that is by default
disabled in the present commit, and supporting manual toggle of it will be
the matter of a later commit.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_devfreq.c | 18 ++++++++-
 drivers/gpu/drm/panthor/panthor_device.h  | 14 +++++++
 drivers/gpu/drm/panthor/panthor_drv.c     | 35 ++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_sched.c   | 45 +++++++++++++++++++++++
 4 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
index c6d3c327cc24..9d0f891b9b53 100644
--- a/drivers/gpu/drm/panthor/panthor_devfreq.c
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
@@ -62,14 +62,20 @@ static void panthor_devfreq_update_utilization(struct panthor_devfreq *pdevfreq)
 static int panthor_devfreq_target(struct device *dev, unsigned long *freq,
 				  u32 flags)
 {
+	struct panthor_device *ptdev = dev_get_drvdata(dev);
 	struct dev_pm_opp *opp;
+	int err;
 
 	opp = devfreq_recommended_opp(dev, freq, flags);
 	if (IS_ERR(opp))
 		return PTR_ERR(opp);
 	dev_pm_opp_put(opp);
 
-	return dev_pm_opp_set_rate(dev, *freq);
+	err = dev_pm_opp_set_rate(dev, *freq);
+	if (!err)
+		ptdev->current_frequency = *freq;
+
+	return err;
 }
 
 static void panthor_devfreq_reset(struct panthor_devfreq *pdevfreq)
@@ -130,6 +136,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
 	struct panthor_devfreq *pdevfreq;
 	struct dev_pm_opp *opp;
 	unsigned long cur_freq;
+	unsigned long freq = ULONG_MAX;
 	int ret;
 
 	pdevfreq = drmm_kzalloc(&ptdev->base, sizeof(*ptdev->devfreq), GFP_KERNEL);
@@ -161,6 +168,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
 		return PTR_ERR(opp);
 
 	panthor_devfreq_profile.initial_freq = cur_freq;
+	ptdev->current_frequency = cur_freq;
 
 	/* Regulator coupling only takes care of synchronizing/balancing voltage
 	 * updates, but the coupled regulator needs to be enabled manually.
@@ -204,6 +212,14 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
 
 	dev_pm_opp_put(opp);
 
+	/* Find the fastest defined rate  */
+	opp = dev_pm_opp_find_freq_floor(dev, &freq);
+	if (IS_ERR(opp))
+		return PTR_ERR(opp);
+	ptdev->fast_rate = freq;
+
+	dev_pm_opp_put(opp);
+
 	/*
 	 * Setup default thresholds for the simple_ondemand governor.
 	 * The values are chosen based on experiments.
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index a48e30d0af30..0e68f5a70d20 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -184,6 +184,17 @@ struct panthor_device {
 
 	/** @profile_mask: User-set profiling flags for job accounting. */
 	u32 profile_mask;
+
+	/** @current_frequency: Device clock frequency at present. Set by DVFS*/
+	unsigned long current_frequency;
+
+	/** @fast_rate: Maximum device clock frequency. Set by DVFS */
+	unsigned long fast_rate;
+};
+
+struct panthor_gpu_usage {
+	u64 time;
+	u64 cycles;
 };
 
 /**
@@ -198,6 +209,9 @@ struct panthor_file {
 
 	/** @groups: Scheduling group pool attached to this file. */
 	struct panthor_group_pool *groups;
+
+	/** @stats: cycle and timestamp measures for job execution. */
+	struct panthor_gpu_usage stats;
 };
 
 int panthor_device_init(struct panthor_device *ptdev);
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index b5e7b919f241..e18838754963 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -3,12 +3,17 @@
 /* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
 /* Copyright 2019 Collabora ltd. */
 
+#ifdef CONFIG_ARM_ARCH_TIMER
+#include <asm/arch_timer.h>
+#endif
+
 #include <linux/list.h>
 #include <linux/module.h>
 #include <linux/of_platform.h>
 #include <linux/pagemap.h>
 #include <linux/platform_device.h>
 #include <linux/pm_runtime.h>
+#include <linux/time64.h>
 
 #include <drm/drm_debugfs.h>
 #include <drm/drm_drv.h>
@@ -1351,6 +1356,34 @@ static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
 	return ret;
 }
 
+static void panthor_gpu_show_fdinfo(struct panthor_device *ptdev,
+				    struct panthor_file *pfile,
+				    struct drm_printer *p)
+{
+	if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_TIMESTAMP) {
+#ifdef CONFIG_ARM_ARCH_TIMER
+		drm_printf(p, "drm-engine-panthor:\t%llu ns\n",
+			   DIV_ROUND_UP_ULL((pfile->stats.time * NSEC_PER_SEC),
+					    arch_timer_get_cntfrq()));
+#endif
+	}
+	if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_CYCLES)
+		drm_printf(p, "drm-cycles-panthor:\t%llu\n", pfile->stats.cycles);
+
+	drm_printf(p, "drm-maxfreq-panthor:\t%lu Hz\n", ptdev->fast_rate);
+	drm_printf(p, "drm-curfreq-panthor:\t%lu Hz\n", ptdev->current_frequency);
+}
+
+static void panthor_show_fdinfo(struct drm_printer *p, struct drm_file *file)
+{
+	struct drm_device *dev = file->minor->dev;
+	struct panthor_device *ptdev = container_of(dev, struct panthor_device, base);
+
+	panthor_gpu_show_fdinfo(ptdev, file->driver_priv, p);
+
+	drm_show_memory_stats(p, file);
+}
+
 static const struct file_operations panthor_drm_driver_fops = {
 	.open = drm_open,
 	.release = drm_release,
@@ -1360,6 +1393,7 @@ static const struct file_operations panthor_drm_driver_fops = {
 	.read = drm_read,
 	.llseek = noop_llseek,
 	.mmap = panthor_mmap,
+	.show_fdinfo = drm_show_fdinfo,
 };
 
 #ifdef CONFIG_DEBUG_FS
@@ -1378,6 +1412,7 @@ static const struct drm_driver panthor_drm_driver = {
 			   DRIVER_SYNCOBJ_TIMELINE | DRIVER_GEM_GPUVA,
 	.open = panthor_open,
 	.postclose = panthor_postclose,
+	.show_fdinfo = panthor_show_fdinfo,
 	.ioctls = panthor_drm_driver_ioctls,
 	.num_ioctls = ARRAY_SIZE(panthor_drm_driver_ioctls),
 	.fops = &panthor_drm_driver_fops,
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index b087648bf59a..e69ab5175ae8 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -619,6 +619,18 @@ struct panthor_group {
 	 */
 	struct panthor_kernel_bo *syncobjs;
 
+	/** @fdinfo: Per-file total cycle and timestamp values reference. */
+	struct {
+		/** @data: Pointer to actual per-file sample data. */
+		struct panthor_gpu_usage *data;
+
+		/**
+		 * @lock: Mutex to govern concurrent access from drm file's fdinfo callback
+		 * and job post-completion processing function
+		 */
+		struct mutex lock;
+	} fdinfo;
+
 	/** @state: Group state. */
 	enum panthor_group_state state;
 
@@ -886,6 +898,8 @@ static void group_release_work(struct work_struct *work)
 						   release_work);
 	u32 i;
 
+	mutex_destroy(&group->fdinfo.lock);
+
 	for (i = 0; i < group->queue_count; i++)
 		group_free_queue(group, group->queues[i]);
 
@@ -2808,6 +2822,28 @@ void panthor_sched_post_reset(struct panthor_device *ptdev, bool reset_failed)
 	}
 }
 
+static void update_fdinfo_stats(struct panthor_job *job)
+{
+	struct panthor_group *group = job->group;
+	struct panthor_queue *queue = group->queues[job->queue_idx];
+	struct panthor_gpu_usage *fdinfo;
+	struct panthor_job_profiling_data *times;
+
+	times = (struct panthor_job_profiling_data *)
+		((unsigned long) queue->profiling_info.slots->kmap +
+		 (job->profiling_slot * sizeof(struct panthor_job_profiling_data)));
+
+	mutex_lock(&group->fdinfo.lock);
+	if ((group->fdinfo.data)) {
+		fdinfo = group->fdinfo.data;
+		if (job->profile_mask & PANTHOR_DEVICE_PROFILING_CYCLES)
+			fdinfo->cycles += times->cycles.after - times->cycles.before;
+		if (job->profile_mask & PANTHOR_DEVICE_PROFILING_TIMESTAMP)
+			fdinfo->time += times->time.after - times->time.before;
+	}
+	mutex_unlock(&group->fdinfo.lock);
+}
+
 static void group_sync_upd_work(struct work_struct *work)
 {
 	struct panthor_group *group =
@@ -2840,6 +2876,8 @@ static void group_sync_upd_work(struct work_struct *work)
 	dma_fence_end_signalling(cookie);
 
 	list_for_each_entry_safe(job, job_tmp, &done_jobs, node) {
+		if (job->profile_mask)
+			update_fdinfo_stats(job);
 		list_del_init(&job->node);
 		panthor_job_put(&job->base);
 	}
@@ -3430,6 +3468,9 @@ int panthor_group_create(struct panthor_file *pfile,
 	}
 	mutex_unlock(&sched->reset.lock);
 
+	group->fdinfo.data = &pfile->stats;
+	mutex_init(&group->fdinfo.lock);
+
 	return gid;
 
 err_put_group:
@@ -3469,6 +3510,10 @@ int panthor_group_destroy(struct panthor_file *pfile, u32 group_handle)
 	mutex_unlock(&sched->lock);
 	mutex_unlock(&sched->reset.lock);
 
+	mutex_lock(&group->fdinfo.lock);
+	group->fdinfo.data = NULL;
+	mutex_unlock(&group->fdinfo.lock);
+
 	group_put(group);
 	return 0;
 }
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 3/4] drm/panthor: enable fdinfo for memory stats
  2024-09-03 20:25 [PATCH v5 0/4] Support fdinfo runtime and memory stats on Panthor Adrián Larumbe
  2024-09-03 20:25 ` [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
  2024-09-03 20:25 ` [PATCH v5 2/4] drm/panthor: add DRM fdinfo support Adrián Larumbe
@ 2024-09-03 20:25 ` Adrián Larumbe
  2024-09-04  8:03   ` Boris Brezillon
  2024-09-03 20:25 ` [PATCH v5 4/4] drm/panthor: add sysfs knob for enabling job profiling Adrián Larumbe
  3 siblings, 1 reply; 15+ messages in thread
From: Adrián Larumbe @ 2024-09-03 20:25 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter,
	Sumit Semwal, Christian König
  Cc: kernel, Adrián Larumbe, dri-devel, linux-kernel, linux-media,
	linaro-mm-sig

Implement drm object's status callback.

Also, we consider a PRIME imported BO to be resident if its matching
dma_buf has an open attachment, which means its backing storage had already
been allocated.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
---
 drivers/gpu/drm/panthor/panthor_gem.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
index 38f560864879..c60b599665d8 100644
--- a/drivers/gpu/drm/panthor/panthor_gem.c
+++ b/drivers/gpu/drm/panthor/panthor_gem.c
@@ -145,6 +145,17 @@ panthor_gem_prime_export(struct drm_gem_object *obj, int flags)
 	return drm_gem_prime_export(obj, flags);
 }
 
+static enum drm_gem_object_status panthor_gem_status(struct drm_gem_object *obj)
+{
+	struct panthor_gem_object *bo = to_panthor_bo(obj);
+	enum drm_gem_object_status res = 0;
+
+	if (bo->base.base.import_attach || bo->base.pages)
+		res |= DRM_GEM_OBJECT_RESIDENT;
+
+	return res;
+}
+
 static const struct drm_gem_object_funcs panthor_gem_funcs = {
 	.free = panthor_gem_free_object,
 	.print_info = drm_gem_shmem_object_print_info,
@@ -154,6 +165,7 @@ static const struct drm_gem_object_funcs panthor_gem_funcs = {
 	.vmap = drm_gem_shmem_object_vmap,
 	.vunmap = drm_gem_shmem_object_vunmap,
 	.mmap = panthor_gem_mmap,
+	.status = panthor_gem_status,
 	.export = panthor_gem_prime_export,
 	.vm_ops = &drm_gem_shmem_vm_ops,
 };
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 4/4] drm/panthor: add sysfs knob for enabling job profiling
  2024-09-03 20:25 [PATCH v5 0/4] Support fdinfo runtime and memory stats on Panthor Adrián Larumbe
                   ` (2 preceding siblings ...)
  2024-09-03 20:25 ` [PATCH v5 3/4] drm/panthor: enable fdinfo for memory stats Adrián Larumbe
@ 2024-09-03 20:25 ` Adrián Larumbe
  2024-09-04  8:07   ` Boris Brezillon
  3 siblings, 1 reply; 15+ messages in thread
From: Adrián Larumbe @ 2024-09-03 20:25 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter,
	Sumit Semwal, Christian König
  Cc: kernel, Adrián Larumbe, dri-devel, linux-kernel, linux-media,
	linaro-mm-sig

This commit introduces a DRM device sysfs attribute that lets UM control
the job accounting status in the device. The knob variable had been brought
in as part of a previous commit, but now we're able to fix it manually.

As sysfs files are part of a driver's uAPI, describe its legitimate input
values and output format in a documentation file.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
---
 Documentation/gpu/panthor.rst         | 46 +++++++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_drv.c | 39 +++++++++++++++++++++++
 2 files changed, 85 insertions(+)
 create mode 100644 Documentation/gpu/panthor.rst

diff --git a/Documentation/gpu/panthor.rst b/Documentation/gpu/panthor.rst
new file mode 100644
index 000000000000..cbf5c4429a2d
--- /dev/null
+++ b/Documentation/gpu/panthor.rst
@@ -0,0 +1,46 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=========================
+ drm/Panthor CSF driver
+=========================
+
+.. _panfrost-usage-stats:
+
+Panthor DRM client usage stats implementation
+==============================================
+
+The drm/Panthor driver implements the DRM client usage stats specification as
+documented in :ref:`drm-client-usage-stats`.
+
+Example of the output showing the implemented key value pairs and entirety of
+the currently possible format options:
+
+::
+     pos:    0
+     flags:  02400002
+     mnt_id: 29
+     ino:    491
+     drm-driver:     panthor
+     drm-client-id:  10
+     drm-engine-panthor:     111110952750 ns
+     drm-cycles-panthor:     94439687187
+     drm-maxfreq-panthor:    1000000000 Hz
+     drm-curfreq-panthor:    1000000000 Hz
+     drm-total-memory:       16480 KiB
+     drm-shared-memory:      0
+     drm-active-memory:      16200 KiB
+     drm-resident-memory:    16480 KiB
+     drm-purgeable-memory:   0
+
+Possible `drm-engine-` key names are: `panthor`.
+`drm-curfreq-` values convey the current operating frequency for that engine.
+
+Users must bear in mind that engine and cycle sampling are disabled by default,
+because of power saving concerns. `fdinfo` users and benchmark applications which
+query the fdinfo file must make sure to toggle the job profiling status of the
+driver by writing into the appropriate sysfs node::
+
+    echo <N> > /sys/bus/platform/drivers/panthor/[a-f0-9]*.gpu/profiling
+
+Where `N` is a bit mask where cycle and timestamp sampling are respectively
+enabled by the first and second bits.
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index e18838754963..26475db96c41 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1450,6 +1450,44 @@ static void panthor_remove(struct platform_device *pdev)
 	panthor_device_unplug(ptdev);
 }
 
+static ssize_t profiling_show(struct device *dev,
+			      struct device_attribute *attr,
+			      char *buf)
+{
+	struct panthor_device *ptdev = dev_get_drvdata(dev);
+
+	return sysfs_emit(buf, "%d\n", ptdev->profile_mask);
+}
+
+static ssize_t profiling_store(struct device *dev,
+			       struct device_attribute *attr,
+			       const char *buf, size_t len)
+{
+	struct panthor_device *ptdev = dev_get_drvdata(dev);
+	u32 value;
+	int err;
+
+	err = kstrtou32(buf, 0, &value);
+	if (err)
+		return err;
+
+	if ((value & ~PANTHOR_DEVICE_PROFILING_ALL) != 0)
+		return -EINVAL;
+
+	ptdev->profile_mask = value;
+
+	return len;
+}
+
+static DEVICE_ATTR_RW(profiling);
+
+static struct attribute *panthor_attrs[] = {
+	&dev_attr_profiling.attr,
+	NULL,
+};
+
+ATTRIBUTE_GROUPS(panthor);
+
 static const struct of_device_id dt_match[] = {
 	{ .compatible = "rockchip,rk3588-mali" },
 	{ .compatible = "arm,mali-valhall-csf" },
@@ -1469,6 +1507,7 @@ static struct platform_driver panthor_driver = {
 		.name = "panthor",
 		.pm = pm_ptr(&panthor_pm_ops),
 		.of_match_table = dt_match,
+		.dev_groups = panthor_groups,
 	},
 };
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-09-03 20:25 ` [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
@ 2024-09-04  7:49   ` Boris Brezillon
  2024-09-12 15:03     ` Adrián Larumbe
  2024-09-04 16:10   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Boris Brezillon @ 2024-09-04  7:49 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König, kernel, dri-devel, linux-kernel,
	linux-media, linaro-mm-sig

On Tue,  3 Sep 2024 21:25:35 +0100
Adrián Larumbe <adrian.larumbe@collabora.com> wrote:

> Enable calculations of job submission times in clock cycles and wall
> time. This is done by expanding the boilerplate command stream when running
> a job to include instructions that compute said times right before an after
> a user CS.
> 
> A separate kernel BO is created per queue to store those values. Jobs can
> access their sampled data through a slots buffer-specific index different
> from that of the queue's ringbuffer. The reason for this is saving memory
> on the profiling information kernel BO, since the amount of simultaneous
> profiled jobs we can write into the queue's ringbuffer might be much
> smaller than for regular jobs, as the former take more CSF instructions.
> 
> This commit is done in preparation for enabling DRM fdinfo support in the
> Panthor driver, which depends on the numbers calculated herein.
> 
> A profile mode mask has been added that will in a future commit allow UM to
> toggle performance metric sampling behaviour, which is disabled by default
> to save power. When a ringbuffer CS is constructed, timestamp and cycling
> sampling instructions are added depending on the enabled flags in the
> profiling mask.
> 
> A helper was provided that calculates the number of instructions for a
> given set of enablement mask, and these are passed as the number of credits
> when initialising a DRM scheduler job.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.h |  22 ++
>  drivers/gpu/drm/panthor/panthor_sched.c  | 327 ++++++++++++++++++++---
>  2 files changed, 305 insertions(+), 44 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index e388c0472ba7..a48e30d0af30 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -66,6 +66,25 @@ struct panthor_irq {
>  	atomic_t suspended;
>  };
>  
> +/**
> + * enum panthor_device_profiling_mode - Profiling state
> + */
> +enum panthor_device_profiling_flags {
> +	/** @PANTHOR_DEVICE_PROFILING_DISABLED: Profiling is disabled. */
> +	PANTHOR_DEVICE_PROFILING_DISABLED = 0,
> +
> +	/** @PANTHOR_DEVICE_PROFILING_CYCLES: Sampling job cycles. */
> +	PANTHOR_DEVICE_PROFILING_CYCLES = BIT(0),
> +
> +	/** @PANTHOR_DEVICE_PROFILING_TIMESTAMP: Sampling job timestamp. */
> +	PANTHOR_DEVICE_PROFILING_TIMESTAMP = BIT(1),
> +
> +	/** @PANTHOR_DEVICE_PROFILING_ALL: Sampling everything. */
> +	PANTHOR_DEVICE_PROFILING_ALL =
> +	PANTHOR_DEVICE_PROFILING_CYCLES |
> +	PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> +};
> +
>  /**
>   * struct panthor_device - Panthor device
>   */
> @@ -162,6 +181,9 @@ struct panthor_device {
>  		 */
>  		struct page *dummy_latest_flush;
>  	} pm;
> +
> +	/** @profile_mask: User-set profiling flags for job accounting. */
> +	u32 profile_mask;
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index c426a392b081..b087648bf59a 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -93,6 +93,9 @@
>  #define MIN_CSGS				3
>  #define MAX_CSG_PRIO				0xf
>  
> +#define NUM_INSTRS_PER_CACHE_LINE		(64 / sizeof(u64))
> +#define MAX_INSTRS_PER_JOB			32
> +
>  struct panthor_group;
>  
>  /**
> @@ -476,6 +479,18 @@ struct panthor_queue {
>  		 */
>  		struct list_head in_flight_jobs;
>  	} fence_ctx;
> +
> +	/** @profiling_info: Job profiling data slots and access information. */
> +	struct {
> +		/** @slots: Kernel BO holding the slots. */
> +		struct panthor_kernel_bo *slots;
> +
> +		/** @slot_count: Number of jobs ringbuffer can hold at once. */
> +		u32 slot_count;
> +
> +		/** @profiling_seqno: Index of the next available profiling information slot. */
> +		u32 profiling_seqno;

Nit: no need to repeat profiling as it's under the profiling_info
struct. I would simply name that one 'seqno'.

> +	} profiling_info;

s/profiling_info/profiling/ ?

>  };
>  
>  /**
> @@ -661,6 +676,18 @@ struct panthor_group {
>  	struct list_head wait_node;
>  };
>  
> +struct panthor_job_profiling_data {
> +	struct {
> +		u64 before;
> +		u64 after;
> +	} cycles;
> +
> +	struct {
> +		u64 before;
> +		u64 after;
> +	} time;
> +};
> +
>  /**
>   * group_queue_work() - Queue a group work
>   * @group: Group to queue the work for.
> @@ -774,6 +801,12 @@ struct panthor_job {
>  
>  	/** @done_fence: Fence signaled when the job is finished or cancelled. */
>  	struct dma_fence *done_fence;
> +
> +	/** @profile_mask: Current device job profiling enablement bitmask. */
> +	u32 profile_mask;
> +
> +	/** @profile_slot: Job profiling information index in the profiling slots BO. */
> +	u32 profiling_slot;

Nit: we tend to group fields together under sub-structs, so I'd say:

	struct {
		u32 mask; // or flags
		u32 slot;
	} profiling;

>  };
>  
>  static void
> @@ -838,6 +871,7 @@ static void group_free_queue(struct panthor_group *group, struct panthor_queue *
>  
>  	panthor_kernel_bo_destroy(queue->ringbuf);
>  	panthor_kernel_bo_destroy(queue->iface.mem);
> +	panthor_kernel_bo_destroy(queue->profiling_info.slots);
>  
>  	/* Release the last_fence we were holding, if any. */
>  	dma_fence_put(queue->fence_ctx.last_fence);
> @@ -1982,8 +2016,6 @@ tick_ctx_init(struct panthor_scheduler *sched,
>  	}
>  }
>  
> -#define NUM_INSTRS_PER_SLOT		16
> -
>  static void
>  group_term_post_processing(struct panthor_group *group)
>  {
> @@ -2815,65 +2847,211 @@ static void group_sync_upd_work(struct work_struct *work)
>  	group_put(group);
>  }
>  
> -static struct dma_fence *
> -queue_run_job(struct drm_sched_job *sched_job)
> +struct panthor_job_ringbuf_instrs {
> +	u64 buffer[MAX_INSTRS_PER_JOB];
> +	u32 count;
> +};
> +
> +struct panthor_job_instr {
> +	u32 profile_mask;
> +	u64 instr;
> +};
> +
> +#define JOB_INSTR(__prof, __instr) \
> +	{ \
> +		.profile_mask = __prof, \
> +		.instr = __instr, \
> +	}
> +
> +static void
> +copy_instrs_to_ringbuf(struct panthor_queue *queue,
> +		       struct panthor_job *job,
> +		       struct panthor_job_ringbuf_instrs *instrs)
> +{
> +	ssize_t ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
> +	u32 start = job->ringbuf.start & (ringbuf_size - 1);
> +	ssize_t size, written;
> +
> +	/*
> +	 * We need to write a whole slot, including any trailing zeroes
> +	 * that may come at the end of it. Also, because instrs.buffer had
> +	 * been zero-initialised, there's no need to pad it with 0's
> +	 */
> +	instrs->count = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE);
> +	size = instrs->count * sizeof(u64);
> +	written = min(ringbuf_size - start, size);
> +
> +	memcpy(queue->ringbuf->kmap + start, instrs->buffer, written);
> +
> +	if (written < size)
> +		memcpy(queue->ringbuf->kmap,
> +		       &instrs->buffer[(ringbuf_size - start)/sizeof(u64)],
> +		       size - written);
> +}
> +
> +struct panthor_job_cs_params {
> +	u32 profile_mask;
> +	u64 addr_reg; u64 val_reg;
> +	u64 cycle_reg; u64 time_reg;
> +	u64 sync_addr; u64 times_addr;
> +	u64 cs_start; u64 cs_size;
> +	u32 last_flush; u32 waitall_mask;
> +};
> +
> +static void
> +get_job_cs_params(struct panthor_job *job, struct panthor_job_cs_params *params)
>  {
> -	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
>  	struct panthor_group *group = job->group;
>  	struct panthor_queue *queue = group->queues[job->queue_idx];
>  	struct panthor_device *ptdev = group->ptdev;
>  	struct panthor_scheduler *sched = ptdev->scheduler;
> -	u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
> -	u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1);
> -	u64 addr_reg = ptdev->csif_info.cs_reg_count -
> -		       ptdev->csif_info.unpreserved_cs_reg_count;
> -	u64 val_reg = addr_reg + 2;
> -	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> -			job->queue_idx * sizeof(struct panthor_syncobj_64b);
> -	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
> -	struct dma_fence *done_fence;
> -	int ret;
>  
> -	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
> +	params->addr_reg = ptdev->csif_info.cs_reg_count -
> +			   ptdev->csif_info.unpreserved_cs_reg_count;
> +	params->val_reg = params->addr_reg + 2;
> +	params->cycle_reg = params->addr_reg;
> +	params->time_reg = params->val_reg;
> +
> +	params->sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> +			    job->queue_idx * sizeof(struct panthor_syncobj_64b);
> +	params->times_addr = panthor_kernel_bo_gpuva(queue->profiling_info.slots) +
> +			     (job->profiling_slot * sizeof(struct panthor_job_profiling_data));
> +	params->waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
> +
> +	params->cs_start = job->call_info.start;
> +	params->cs_size = job->call_info.size;
> +	params->last_flush = job->call_info.latest_flush;
> +
> +	params->profile_mask = job->profile_mask;
> +}
> +
> +static void
> +prepare_job_instrs(const struct panthor_job_cs_params *params,
> +		   struct panthor_job_ringbuf_instrs *instrs)
> +{
> +	const struct panthor_job_instr instr_seq[] = {
>  		/* MOV32 rX+2, cs.latest_flush */
> -		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (2ull << 56) | (params->val_reg << 48) | params->last_flush),
>  
>  		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
> -		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (36ull << 56) | (0ull << 48) | (params->val_reg << 40) | (0 << 16) | 0x233),
> +
> +		/* MOV48 rX:rX+1, cycles_offset */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> +			  (1ull << 56) | (params->cycle_reg << 48) |
> +			  (params->times_addr +
> +			   offsetof(struct panthor_job_profiling_data, cycles.before))),
> +		/* STORE_STATE cycles */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> +			  (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)),
> +		/* MOV48 rX:rX+1, time_offset */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> +			  (1ull << 56) | (params->time_reg << 48) |
> +			  (params->times_addr +
> +			   offsetof(struct panthor_job_profiling_data, time.before))),
> +		/* STORE_STATE timer */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> +			  (40ull << 56) | (params->time_reg << 40) | (0ll << 32)),
>  
>  		/* MOV48 rX:rX+1, cs.start */
> -		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (1ull << 56) | (params->addr_reg << 48) | params->cs_start),
>  		/* MOV32 rX+2, cs.size */
> -		(2ull << 56) | (val_reg << 48) | job->call_info.size,
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (2ull << 56) | (params->val_reg << 48) | params->cs_size),
>  		/* WAIT(0) => waits for FLUSH_CACHE2 instruction */
> -		(3ull << 56) | (1 << 16),
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (3ull << 56) | (1 << 16)),
>  		/* CALL rX:rX+1, rX+2 */
> -		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (32ull << 56) | (params->addr_reg << 40) | (params->val_reg << 32)),
> +
> +		/* MOV48 rX:rX+1, cycles_offset */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> +			  (1ull << 56) | (params->cycle_reg << 48) |
> +			  (params->times_addr +
> +			   offsetof(struct panthor_job_profiling_data, cycles.after))),
> +		/* STORE_STATE cycles */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> +			  (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)),
> +
> +		/* MOV48 rX:rX+1, time_offset */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> +			  (1ull << 56) | (params->time_reg << 48) |
> +			  (params->times_addr +
> +			   offsetof(struct panthor_job_profiling_data, time.after))),
> +		/* STORE_STATE timer */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> +			  (40ull << 56) | (params->time_reg << 40) | (0ll << 32)),
>  
>  		/* MOV48 rX:rX+1, sync_addr */
> -		(1ull << 56) | (addr_reg << 48) | sync_addr,
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (1ull << 56) | (params->addr_reg << 48) | params->sync_addr),
>  		/* MOV48 rX+2, #1 */
> -		(1ull << 56) | (val_reg << 48) | 1,
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (1ull << 56) | (params->val_reg << 48) | 1),
>  		/* WAIT(all) */
> -		(3ull << 56) | (waitall_mask << 16),
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (3ull << 56) | (params->waitall_mask << 16)),
>  		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
> -		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (51ull << 56) | (0ull << 48) | (params->addr_reg << 40) |
> +			  (params->val_reg << 32) | (0 << 16) | 1),
>  		/* ERROR_BARRIER, so we can recover from faults at job
>  		 * boundaries.
>  		 */
> -		(47ull << 56),
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (47ull << 56)),
> +	};
> +	u32 pad;
> +
> +	/* NEED to be cacheline aligned to please the prefetcher. */
> +	static_assert(sizeof(instrs->buffer) % 64 == 0,
> +		      "panthor_job_ringbuf_instrs::buffer is not aligned on a cacheline");
> +
> +	/* Make sure we have enough storage to store the whole sequence. */
> +	static_assert(ALIGN(ARRAY_SIZE(instr_seq), NUM_INSTRS_PER_CACHE_LINE) <=
> +		      ARRAY_SIZE(instrs->buffer),
> +		      "instr_seq vs panthor_job_ringbuf_instrs::buffer size mismatch");

We probably want to catch situations where instrs->buffer has gone
bigger than needed (say we found a way to drop instructions), so I
would turn the '<=' condition into an '=='.

> +
> +	for (u32 i = 0; i < ARRAY_SIZE(instr_seq); i++) {
> +		/* If the profile mask of this instruction is not enabled, skip it. */
> +		if (instr_seq[i].profile_mask &&
> +		    !(instr_seq[i].profile_mask & params->profile_mask))
> +			continue;
> +
> +		instrs->buffer[instrs->count++] = instr_seq[i].instr;
> +	}
> +
> +	pad = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE);
> +	memset(&instrs->buffer[instrs->count], 0,
> +	       (pad - instrs->count) * sizeof(instrs->buffer[0]));
> +	instrs->count = pad;
> +}
> +
> +static u32 calc_job_credits(u32 profile_mask)
> +{
> +	struct panthor_job_ringbuf_instrs instrs;
> +	struct panthor_job_cs_params params = {
> +		.profile_mask = profile_mask,
>  	};
>  
> -	/* Need to be cacheline aligned to please the prefetcher. */
> -	static_assert(sizeof(call_instrs) % 64 == 0,
> -		      "call_instrs is not aligned on a cacheline");
> +	prepare_job_instrs(&params, &instrs);
> +	return instrs.count;
> +}
> +
> +static struct dma_fence *
> +queue_run_job(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> +	struct panthor_group *group = job->group;
> +	struct panthor_queue *queue = group->queues[job->queue_idx];
> +	struct panthor_device *ptdev = group->ptdev;
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_job_ringbuf_instrs instrs;
> +	struct panthor_job_cs_params cs_params;
> +	struct dma_fence *done_fence;
> +	int ret;
>  
>  	/* Stream size is zero, nothing to do except making sure all previously
>  	 * submitted jobs are done before we signal the
> @@ -2900,17 +3078,23 @@ queue_run_job(struct drm_sched_job *sched_job)
>  		       queue->fence_ctx.id,
>  		       atomic64_inc_return(&queue->fence_ctx.seqno));
>  
> -	memcpy(queue->ringbuf->kmap + ringbuf_insert,
> -	       call_instrs, sizeof(call_instrs));
> +	job->profiling_slot = queue->profiling_info.profiling_seqno++;
> +	if (queue->profiling_info.profiling_seqno == queue->profiling_info.slot_count)
> +		queue->profiling_info.profiling_seqno = 0;
> +
> +	job->ringbuf.start = queue->iface.input->insert;
> +
> +	get_job_cs_params(job, &cs_params);
> +	prepare_job_instrs(&cs_params, &instrs);
> +	copy_instrs_to_ringbuf(queue, job, &instrs);
> +
> +	job->ringbuf.end = job->ringbuf.start + (instrs.count * sizeof(u64));
>  
>  	panthor_job_get(&job->base);
>  	spin_lock(&queue->fence_ctx.lock);
>  	list_add_tail(&job->node, &queue->fence_ctx.in_flight_jobs);
>  	spin_unlock(&queue->fence_ctx.lock);
>  
> -	job->ringbuf.start = queue->iface.input->insert;
> -	job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
> -
>  	/* Make sure the ring buffer is updated before the INSERT
>  	 * register.
>  	 */
> @@ -3003,6 +3187,24 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>  	.free_job = queue_free_job,
>  };
>  
> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
> +				       u32 cs_ringbuf_size)
> +{
> +	u32 min_profiled_job_instrs = U32_MAX;
> +	u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
> +
> +	for (u32 i = 0; i < last_flag; i++) {
> +		if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
> +			min_profiled_job_instrs =
> +				min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
> +	}

Okay, I think this loop deserves an explanation. The goal is to
calculate the minimal size of a profile job so we can deduce the
maximum number of profiling slots that will be used simultaneously. We
ignore PANTHOR_DEVICE_PROFILING_DISABLED, because those jobs won't use
a profiling slot in the first place.

> +
> +	drm_WARN_ON(&ptdev->base,
> +		    !IS_ALIGNED(min_profiled_job_instrs, NUM_INSTRS_PER_CACHE_LINE));

We can probably drop this WARN_ON(), it's supposed to be checked in
calc_job_credits().

> +
> +	return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
> +}
> +
>  static struct panthor_queue *
>  group_create_queue(struct panthor_group *group,
>  		   const struct drm_panthor_queue_create *args)
> @@ -3056,9 +3258,38 @@ group_create_queue(struct panthor_group *group,
>  		goto err_free_queue;
>  	}
>  
> +	queue->profiling_info.slot_count =
> +		calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
> +
> +	queue->profiling_info.slots =
> +		panthor_kernel_bo_create(group->ptdev, group->vm,
> +					 queue->profiling_info.slot_count *
> +					 sizeof(struct panthor_job_profiling_data),
> +					 DRM_PANTHOR_BO_NO_MMAP,
> +					 DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> +					 DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> +					 PANTHOR_VM_KERNEL_AUTO_VA);
> +
> +	if (IS_ERR(queue->profiling_info.slots)) {
> +		ret = PTR_ERR(queue->profiling_info.slots);
> +		goto err_free_queue;
> +	}
> +
> +	ret = panthor_kernel_bo_vmap(queue->profiling_info.slots);
> +	if (ret)
> +		goto err_free_queue;
> +
> +	memset(queue->profiling_info.slots->kmap, 0,
> +	       queue->profiling_info.slot_count * sizeof(struct panthor_job_profiling_data));

I don't think we need to memset() the profiling buffer.

> +
> +	/*
> +	 * Credit limit argument tells us the total number of instructions
> +	 * across all CS slots in the ringbuffer, with some jobs requiring
> +	 * twice as many as others, depending on their profiling status.
> +	 */
>  	ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>  			     group->ptdev->scheduler->wq, 1,
> -			     args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
> +			     args->ringbuf_size / sizeof(u64),
>  			     0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>  			     group->ptdev->reset.wq,
>  			     NULL, "panthor-queue", group->ptdev->base.dev);
> @@ -3354,6 +3585,7 @@ panthor_job_create(struct panthor_file *pfile,
>  {
>  	struct panthor_group_pool *gpool = pfile->groups;
>  	struct panthor_job *job;
> +	u32 credits;
>  	int ret;
>  
>  	if (qsubmit->pad)
> @@ -3407,9 +3639,16 @@ panthor_job_create(struct panthor_file *pfile,
>  		}
>  	}
>  
> +	job->profile_mask = pfile->ptdev->profile_mask;
> +	credits = calc_job_credits(job->profile_mask);
> +	if (credits == 0) {
> +		ret = -EINVAL;
> +		goto err_put_job;
> +	}
> +
>  	ret = drm_sched_job_init(&job->base,
>  				 &job->group->queues[job->queue_idx]->entity,
> -				 1, job->group);
> +				 credits, job->group);
>  	if (ret)
>  		goto err_put_job;
>  

Just add a bunch of minor comments (mostly cosmetic changes), but the
implementation looks good to me.

Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 2/4] drm/panthor: add DRM fdinfo support
  2024-09-03 20:25 ` [PATCH v5 2/4] drm/panthor: add DRM fdinfo support Adrián Larumbe
@ 2024-09-04  8:01   ` Boris Brezillon
  2024-09-04 17:45   ` kernel test robot
  1 sibling, 0 replies; 15+ messages in thread
From: Boris Brezillon @ 2024-09-04  8:01 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König, kernel, dri-devel, linux-kernel,
	linux-media, linaro-mm-sig

On Tue,  3 Sep 2024 21:25:36 +0100
Adrián Larumbe <adrian.larumbe@collabora.com> wrote:

> Drawing from the FW-calculated values in the previous commit, we can
> increase the numbers for an open file by collecting them from finished jobs
> when updating their group synchronisation objects.
> 
> Display of fdinfo key-value pairs is governed by a flag that is by default
> disabled in the present commit, and supporting manual toggle of it will be
> the matter of a later commit.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_devfreq.c | 18 ++++++++-
>  drivers/gpu/drm/panthor/panthor_device.h  | 14 +++++++
>  drivers/gpu/drm/panthor/panthor_drv.c     | 35 ++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_sched.c   | 45 +++++++++++++++++++++++
>  4 files changed, 111 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
> index c6d3c327cc24..9d0f891b9b53 100644
> --- a/drivers/gpu/drm/panthor/panthor_devfreq.c
> +++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
> @@ -62,14 +62,20 @@ static void panthor_devfreq_update_utilization(struct panthor_devfreq *pdevfreq)
>  static int panthor_devfreq_target(struct device *dev, unsigned long *freq,
>  				  u32 flags)
>  {
> +	struct panthor_device *ptdev = dev_get_drvdata(dev);
>  	struct dev_pm_opp *opp;
> +	int err;
>  
>  	opp = devfreq_recommended_opp(dev, freq, flags);
>  	if (IS_ERR(opp))
>  		return PTR_ERR(opp);
>  	dev_pm_opp_put(opp);
>  
> -	return dev_pm_opp_set_rate(dev, *freq);
> +	err = dev_pm_opp_set_rate(dev, *freq);
> +	if (!err)
> +		ptdev->current_frequency = *freq;
> +
> +	return err;
>  }
>  
>  static void panthor_devfreq_reset(struct panthor_devfreq *pdevfreq)
> @@ -130,6 +136,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
>  	struct panthor_devfreq *pdevfreq;
>  	struct dev_pm_opp *opp;
>  	unsigned long cur_freq;
> +	unsigned long freq = ULONG_MAX;
>  	int ret;
>  
>  	pdevfreq = drmm_kzalloc(&ptdev->base, sizeof(*ptdev->devfreq), GFP_KERNEL);
> @@ -161,6 +168,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
>  		return PTR_ERR(opp);
>  
>  	panthor_devfreq_profile.initial_freq = cur_freq;
> +	ptdev->current_frequency = cur_freq;
>  
>  	/* Regulator coupling only takes care of synchronizing/balancing voltage
>  	 * updates, but the coupled regulator needs to be enabled manually.
> @@ -204,6 +212,14 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
>  
>  	dev_pm_opp_put(opp);
>  
> +	/* Find the fastest defined rate  */
> +	opp = dev_pm_opp_find_freq_floor(dev, &freq);
> +	if (IS_ERR(opp))
> +		return PTR_ERR(opp);
> +	ptdev->fast_rate = freq;
> +
> +	dev_pm_opp_put(opp);
> +
>  	/*
>  	 * Setup default thresholds for the simple_ondemand governor.
>  	 * The values are chosen based on experiments.
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index a48e30d0af30..0e68f5a70d20 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -184,6 +184,17 @@ struct panthor_device {
>  
>  	/** @profile_mask: User-set profiling flags for job accounting. */
>  	u32 profile_mask;
> +
> +	/** @current_frequency: Device clock frequency at present. Set by DVFS*/
> +	unsigned long current_frequency;
> +
> +	/** @fast_rate: Maximum device clock frequency. Set by DVFS */
> +	unsigned long fast_rate;
> +};

Can we move the current_frequency/fast_rate retrieval in a separate
patch?

> +
> +struct panthor_gpu_usage {
> +	u64 time;
> +	u64 cycles;
>  };
>  
>  /**
> @@ -198,6 +209,9 @@ struct panthor_file {
>  
>  	/** @groups: Scheduling group pool attached to this file. */
>  	struct panthor_group_pool *groups;
> +
> +	/** @stats: cycle and timestamp measures for job execution. */
> +	struct panthor_gpu_usage stats;
>  };
>  
>  int panthor_device_init(struct panthor_device *ptdev);
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index b5e7b919f241..e18838754963 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -3,12 +3,17 @@
>  /* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
>  /* Copyright 2019 Collabora ltd. */
>  
> +#ifdef CONFIG_ARM_ARCH_TIMER
> +#include <asm/arch_timer.h>
> +#endif
> +
>  #include <linux/list.h>
>  #include <linux/module.h>
>  #include <linux/of_platform.h>
>  #include <linux/pagemap.h>
>  #include <linux/platform_device.h>
>  #include <linux/pm_runtime.h>
> +#include <linux/time64.h>
>  
>  #include <drm/drm_debugfs.h>
>  #include <drm/drm_drv.h>
> @@ -1351,6 +1356,34 @@ static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
>  	return ret;
>  }
>  
> +static void panthor_gpu_show_fdinfo(struct panthor_device *ptdev,
> +				    struct panthor_file *pfile,
> +				    struct drm_printer *p)
> +{
> +	if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_TIMESTAMP) {
> +#ifdef CONFIG_ARM_ARCH_TIMER
> +		drm_printf(p, "drm-engine-panthor:\t%llu ns\n",
> +			   DIV_ROUND_UP_ULL((pfile->stats.time * NSEC_PER_SEC),
> +					    arch_timer_get_cntfrq()));
> +#endif
> +	}
> +	if (ptdev->profile_mask & PANTHOR_DEVICE_PROFILING_CYCLES)
> +		drm_printf(p, "drm-cycles-panthor:\t%llu\n", pfile->stats.cycles);

Don't know if that's an issue, but another thread might be updating the
stats while show_fdinfo() is run, which means the data you return might
be coming from two different sampling points.

> +
> +	drm_printf(p, "drm-maxfreq-panthor:\t%lu Hz\n", ptdev->fast_rate);
> +	drm_printf(p, "drm-curfreq-panthor:\t%lu Hz\n", ptdev->current_frequency);
> +}
> +
> +static void panthor_show_fdinfo(struct drm_printer *p, struct drm_file *file)
> +{
> +	struct drm_device *dev = file->minor->dev;
> +	struct panthor_device *ptdev = container_of(dev, struct panthor_device, base);
> +
> +	panthor_gpu_show_fdinfo(ptdev, file->driver_priv, p);
> +
> +	drm_show_memory_stats(p, file);
> +}
> +
>  static const struct file_operations panthor_drm_driver_fops = {
>  	.open = drm_open,
>  	.release = drm_release,
> @@ -1360,6 +1393,7 @@ static const struct file_operations panthor_drm_driver_fops = {
>  	.read = drm_read,
>  	.llseek = noop_llseek,
>  	.mmap = panthor_mmap,
> +	.show_fdinfo = drm_show_fdinfo,
>  };
>  
>  #ifdef CONFIG_DEBUG_FS
> @@ -1378,6 +1412,7 @@ static const struct drm_driver panthor_drm_driver = {
>  			   DRIVER_SYNCOBJ_TIMELINE | DRIVER_GEM_GPUVA,
>  	.open = panthor_open,
>  	.postclose = panthor_postclose,
> +	.show_fdinfo = panthor_show_fdinfo,
>  	.ioctls = panthor_drm_driver_ioctls,
>  	.num_ioctls = ARRAY_SIZE(panthor_drm_driver_ioctls),
>  	.fops = &panthor_drm_driver_fops,
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index b087648bf59a..e69ab5175ae8 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -619,6 +619,18 @@ struct panthor_group {
>  	 */
>  	struct panthor_kernel_bo *syncobjs;
>  
> +	/** @fdinfo: Per-file total cycle and timestamp values reference. */
> +	struct {
> +		/** @data: Pointer to actual per-file sample data. */
> +		struct panthor_gpu_usage *data;
> +
> +		/**
> +		 * @lock: Mutex to govern concurrent access from drm file's fdinfo callback
> +		 * and job post-completion processing function
> +		 */
> +		struct mutex lock;
> +	} fdinfo;
> +
>  	/** @state: Group state. */
>  	enum panthor_group_state state;
>  
> @@ -886,6 +898,8 @@ static void group_release_work(struct work_struct *work)
>  						   release_work);
>  	u32 i;
>  
> +	mutex_destroy(&group->fdinfo.lock);
> +
>  	for (i = 0; i < group->queue_count; i++)
>  		group_free_queue(group, group->queues[i]);
>  
> @@ -2808,6 +2822,28 @@ void panthor_sched_post_reset(struct panthor_device *ptdev, bool reset_failed)
>  	}
>  }
>  
> +static void update_fdinfo_stats(struct panthor_job *job)
> +{
> +	struct panthor_group *group = job->group;
> +	struct panthor_queue *queue = group->queues[job->queue_idx];
> +	struct panthor_gpu_usage *fdinfo;
> +	struct panthor_job_profiling_data *times;
> +
> +	times = (struct panthor_job_profiling_data *)
> +		((unsigned long) queue->profiling_info.slots->kmap +
> +		 (job->profiling_slot * sizeof(struct panthor_job_profiling_data)));
> +
> +	mutex_lock(&group->fdinfo.lock);
> +	if ((group->fdinfo.data)) {
> +		fdinfo = group->fdinfo.data;
> +		if (job->profile_mask & PANTHOR_DEVICE_PROFILING_CYCLES)
> +			fdinfo->cycles += times->cycles.after - times->cycles.before;
> +		if (job->profile_mask & PANTHOR_DEVICE_PROFILING_TIMESTAMP)
> +			fdinfo->time += times->time.after - times->time.before;
> +	}
> +	mutex_unlock(&group->fdinfo.lock);
> +}
> +
>  static void group_sync_upd_work(struct work_struct *work)
>  {
>  	struct panthor_group *group =
> @@ -2840,6 +2876,8 @@ static void group_sync_upd_work(struct work_struct *work)
>  	dma_fence_end_signalling(cookie);
>  
>  	list_for_each_entry_safe(job, job_tmp, &done_jobs, node) {
> +		if (job->profile_mask)
> +			update_fdinfo_stats(job);
>  		list_del_init(&job->node);
>  		panthor_job_put(&job->base);
>  	}
> @@ -3430,6 +3468,9 @@ int panthor_group_create(struct panthor_file *pfile,
>  	}
>  	mutex_unlock(&sched->reset.lock);
>  
> +	group->fdinfo.data = &pfile->stats;
> +	mutex_init(&group->fdinfo.lock);
> +
>  	return gid;
>  
>  err_put_group:
> @@ -3469,6 +3510,10 @@ int panthor_group_destroy(struct panthor_file *pfile, u32 group_handle)
>  	mutex_unlock(&sched->lock);
>  	mutex_unlock(&sched->reset.lock);
>  
> +	mutex_lock(&group->fdinfo.lock);
> +	group->fdinfo.data = NULL;
> +	mutex_unlock(&group->fdinfo.lock);
> +
>  	group_put(group);
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 3/4] drm/panthor: enable fdinfo for memory stats
  2024-09-03 20:25 ` [PATCH v5 3/4] drm/panthor: enable fdinfo for memory stats Adrián Larumbe
@ 2024-09-04  8:03   ` Boris Brezillon
  0 siblings, 0 replies; 15+ messages in thread
From: Boris Brezillon @ 2024-09-04  8:03 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König, kernel, dri-devel, linux-kernel,
	linux-media, linaro-mm-sig

On Tue,  3 Sep 2024 21:25:37 +0100
Adrián Larumbe <adrian.larumbe@collabora.com> wrote:

> Implement drm object's status callback.
> 
> Also, we consider a PRIME imported BO to be resident if its matching
> dma_buf has an open attachment, which means its backing storage had already
> been allocated.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>

Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>

> ---
>  drivers/gpu/drm/panthor/panthor_gem.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
> index 38f560864879..c60b599665d8 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.c
> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> @@ -145,6 +145,17 @@ panthor_gem_prime_export(struct drm_gem_object *obj, int flags)
>  	return drm_gem_prime_export(obj, flags);
>  }
>  
> +static enum drm_gem_object_status panthor_gem_status(struct drm_gem_object *obj)
> +{
> +	struct panthor_gem_object *bo = to_panthor_bo(obj);
> +	enum drm_gem_object_status res = 0;
> +
> +	if (bo->base.base.import_attach || bo->base.pages)
> +		res |= DRM_GEM_OBJECT_RESIDENT;
> +
> +	return res;
> +}
> +
>  static const struct drm_gem_object_funcs panthor_gem_funcs = {
>  	.free = panthor_gem_free_object,
>  	.print_info = drm_gem_shmem_object_print_info,
> @@ -154,6 +165,7 @@ static const struct drm_gem_object_funcs panthor_gem_funcs = {
>  	.vmap = drm_gem_shmem_object_vmap,
>  	.vunmap = drm_gem_shmem_object_vunmap,
>  	.mmap = panthor_gem_mmap,
> +	.status = panthor_gem_status,
>  	.export = panthor_gem_prime_export,
>  	.vm_ops = &drm_gem_shmem_vm_ops,
>  };


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 4/4] drm/panthor: add sysfs knob for enabling job profiling
  2024-09-03 20:25 ` [PATCH v5 4/4] drm/panthor: add sysfs knob for enabling job profiling Adrián Larumbe
@ 2024-09-04  8:07   ` Boris Brezillon
  0 siblings, 0 replies; 15+ messages in thread
From: Boris Brezillon @ 2024-09-04  8:07 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König, kernel, dri-devel, linux-kernel,
	linux-media, linaro-mm-sig

On Tue,  3 Sep 2024 21:25:38 +0100
Adrián Larumbe <adrian.larumbe@collabora.com> wrote:

> This commit introduces a DRM device sysfs attribute that lets UM control
> the job accounting status in the device. The knob variable had been brought
> in as part of a previous commit, but now we're able to fix it manually.
> 
> As sysfs files are part of a driver's uAPI, describe its legitimate input
> values and output format in a documentation file.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> ---
>  Documentation/gpu/panthor.rst         | 46 +++++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_drv.c | 39 +++++++++++++++++++++++
>  2 files changed, 85 insertions(+)
>  create mode 100644 Documentation/gpu/panthor.rst
> 
> diff --git a/Documentation/gpu/panthor.rst b/Documentation/gpu/panthor.rst
> new file mode 100644
> index 000000000000..cbf5c4429a2d
> --- /dev/null
> +++ b/Documentation/gpu/panthor.rst
> @@ -0,0 +1,46 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +=========================
> + drm/Panthor CSF driver
> +=========================
> +
> +.. _panfrost-usage-stats:
> +
> +Panthor DRM client usage stats implementation
> +==============================================
> +
> +The drm/Panthor driver implements the DRM client usage stats specification as
> +documented in :ref:`drm-client-usage-stats`.
> +
> +Example of the output showing the implemented key value pairs and entirety of
> +the currently possible format options:
> +
> +::
> +     pos:    0
> +     flags:  02400002
> +     mnt_id: 29
> +     ino:    491
> +     drm-driver:     panthor
> +     drm-client-id:  10
> +     drm-engine-panthor:     111110952750 ns
> +     drm-cycles-panthor:     94439687187
> +     drm-maxfreq-panthor:    1000000000 Hz
> +     drm-curfreq-panthor:    1000000000 Hz
> +     drm-total-memory:       16480 KiB
> +     drm-shared-memory:      0
> +     drm-active-memory:      16200 KiB
> +     drm-resident-memory:    16480 KiB
> +     drm-purgeable-memory:   0
> +
> +Possible `drm-engine-` key names are: `panthor`.
> +`drm-curfreq-` values convey the current operating frequency for that engine.
> +
> +Users must bear in mind that engine and cycle sampling are disabled by default,
> +because of power saving concerns. `fdinfo` users and benchmark applications which
> +query the fdinfo file must make sure to toggle the job profiling status of the
> +driver by writing into the appropriate sysfs node::
> +
> +    echo <N> > /sys/bus/platform/drivers/panthor/[a-f0-9]*.gpu/profiling
> +
> +Where `N` is a bit mask where cycle and timestamp sampling are respectively
> +enabled by the first and second bits.

This should probably be documented in
Documentation/ABI/testing/sysfs-driver-panthor too.

> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index e18838754963..26475db96c41 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -1450,6 +1450,44 @@ static void panthor_remove(struct platform_device *pdev)
>  	panthor_device_unplug(ptdev);
>  }
>  
> +static ssize_t profiling_show(struct device *dev,
> +			      struct device_attribute *attr,
> +			      char *buf)
> +{
> +	struct panthor_device *ptdev = dev_get_drvdata(dev);
> +
> +	return sysfs_emit(buf, "%d\n", ptdev->profile_mask);
> +}
> +
> +static ssize_t profiling_store(struct device *dev,
> +			       struct device_attribute *attr,
> +			       const char *buf, size_t len)
> +{
> +	struct panthor_device *ptdev = dev_get_drvdata(dev);
> +	u32 value;
> +	int err;
> +
> +	err = kstrtou32(buf, 0, &value);
> +	if (err)
> +		return err;
> +
> +	if ((value & ~PANTHOR_DEVICE_PROFILING_ALL) != 0)
> +		return -EINVAL;
> +
> +	ptdev->profile_mask = value;
> +
> +	return len;
> +}
> +
> +static DEVICE_ATTR_RW(profiling);
> +
> +static struct attribute *panthor_attrs[] = {
> +	&dev_attr_profiling.attr,
> +	NULL,
> +};
> +
> +ATTRIBUTE_GROUPS(panthor);
> +
>  static const struct of_device_id dt_match[] = {
>  	{ .compatible = "rockchip,rk3588-mali" },
>  	{ .compatible = "arm,mali-valhall-csf" },
> @@ -1469,6 +1507,7 @@ static struct platform_driver panthor_driver = {
>  		.name = "panthor",
>  		.pm = pm_ptr(&panthor_pm_ops),
>  		.of_match_table = dt_match,
> +		.dev_groups = panthor_groups,
>  	},
>  };
>  


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-09-03 20:25 ` [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
  2024-09-04  7:49   ` Boris Brezillon
@ 2024-09-04 16:10   ` kernel test robot
  2024-09-04 17:02   ` kernel test robot
  2024-09-04 18:55   ` Liviu Dudau
  3 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-09-04 16:10 UTC (permalink / raw)
  To: Adrián Larumbe, Boris Brezillon, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Daniel Vetter, Sumit Semwal, Christian König
  Cc: oe-kbuild-all, kernel, Adrián Larumbe, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig

Hi Adrián,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-misc/drm-misc-next]
[also build test WARNING on linus/master v6.11-rc6 next-20240904]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Adri-n-Larumbe/drm-panthor-introduce-job-cycle-and-timestamp-accounting/20240904-042645
base:   git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link:    https://lore.kernel.org/r/20240903202541.430225-2-adrian.larumbe%40collabora.com
patch subject: [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting
config: x86_64-buildonly-randconfig-002-20240904 (https://download.01.org/0day-ci/archive/20240904/202409042317.CRCMb6bs-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240904/202409042317.CRCMb6bs-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409042317.CRCMb6bs-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'runnable' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'idle' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'waiting' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'has_ref' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'in_progress' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'stopped_groups' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'mem' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'input' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'output' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'input_fw_va' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'output_fw_va' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'gpu_va' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'ref' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'gt' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'sync64' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'bo' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'offset' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'kmap' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'lock' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'id' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'seqno' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'last_fence' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'in_flight_jobs' description in 'panthor_queue'
>> drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'slots' description in 'panthor_queue'
>> drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'slot_count' description in 'panthor_queue'
>> drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'profiling_seqno' description in 'panthor_queue'
>> drivers/gpu/drm/panthor/panthor_sched.c:810: warning: Function parameter or struct member 'profiling_slot' not described in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:810: warning: Excess struct member 'start' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:810: warning: Excess struct member 'size' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:810: warning: Excess struct member 'latest_flush' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:810: warning: Excess struct member 'start' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:810: warning: Excess struct member 'end' description in 'panthor_job'
>> drivers/gpu/drm/panthor/panthor_sched.c:810: warning: Excess struct member 'profile_slot' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:1731: warning: Function parameter or struct member 'ptdev' not described in 'panthor_sched_report_fw_events'
   drivers/gpu/drm/panthor/panthor_sched.c:1731: warning: Function parameter or struct member 'events' not described in 'panthor_sched_report_fw_events'
   drivers/gpu/drm/panthor/panthor_sched.c:2623: warning: Function parameter or struct member 'ptdev' not described in 'panthor_sched_report_mmu_fault'


vim +494 drivers/gpu/drm/panthor/panthor_sched.c

de85488138247d Boris Brezillon 2024-02-29  397  
de85488138247d Boris Brezillon 2024-02-29  398  	/** @ringbuf: Command stream ring-buffer. */
de85488138247d Boris Brezillon 2024-02-29  399  	struct panthor_kernel_bo *ringbuf;
de85488138247d Boris Brezillon 2024-02-29  400  
de85488138247d Boris Brezillon 2024-02-29  401  	/** @iface: Firmware interface. */
de85488138247d Boris Brezillon 2024-02-29  402  	struct {
de85488138247d Boris Brezillon 2024-02-29  403  		/** @mem: FW memory allocated for this interface. */
de85488138247d Boris Brezillon 2024-02-29  404  		struct panthor_kernel_bo *mem;
de85488138247d Boris Brezillon 2024-02-29  405  
de85488138247d Boris Brezillon 2024-02-29  406  		/** @input: Input interface. */
de85488138247d Boris Brezillon 2024-02-29  407  		struct panthor_fw_ringbuf_input_iface *input;
de85488138247d Boris Brezillon 2024-02-29  408  
de85488138247d Boris Brezillon 2024-02-29  409  		/** @output: Output interface. */
de85488138247d Boris Brezillon 2024-02-29  410  		const struct panthor_fw_ringbuf_output_iface *output;
de85488138247d Boris Brezillon 2024-02-29  411  
de85488138247d Boris Brezillon 2024-02-29  412  		/** @input_fw_va: FW virtual address of the input interface buffer. */
de85488138247d Boris Brezillon 2024-02-29  413  		u32 input_fw_va;
de85488138247d Boris Brezillon 2024-02-29  414  
de85488138247d Boris Brezillon 2024-02-29  415  		/** @output_fw_va: FW virtual address of the output interface buffer. */
de85488138247d Boris Brezillon 2024-02-29  416  		u32 output_fw_va;
de85488138247d Boris Brezillon 2024-02-29  417  	} iface;
de85488138247d Boris Brezillon 2024-02-29  418  
de85488138247d Boris Brezillon 2024-02-29  419  	/**
de85488138247d Boris Brezillon 2024-02-29  420  	 * @syncwait: Stores information about the synchronization object this
de85488138247d Boris Brezillon 2024-02-29  421  	 * queue is waiting on.
de85488138247d Boris Brezillon 2024-02-29  422  	 */
de85488138247d Boris Brezillon 2024-02-29  423  	struct {
de85488138247d Boris Brezillon 2024-02-29  424  		/** @gpu_va: GPU address of the synchronization object. */
de85488138247d Boris Brezillon 2024-02-29  425  		u64 gpu_va;
de85488138247d Boris Brezillon 2024-02-29  426  
de85488138247d Boris Brezillon 2024-02-29  427  		/** @ref: Reference value to compare against. */
de85488138247d Boris Brezillon 2024-02-29  428  		u64 ref;
de85488138247d Boris Brezillon 2024-02-29  429  
de85488138247d Boris Brezillon 2024-02-29  430  		/** @gt: True if this is a greater-than test. */
de85488138247d Boris Brezillon 2024-02-29  431  		bool gt;
de85488138247d Boris Brezillon 2024-02-29  432  
de85488138247d Boris Brezillon 2024-02-29  433  		/** @sync64: True if this is a 64-bit sync object. */
de85488138247d Boris Brezillon 2024-02-29  434  		bool sync64;
de85488138247d Boris Brezillon 2024-02-29  435  
de85488138247d Boris Brezillon 2024-02-29  436  		/** @bo: Buffer object holding the synchronization object. */
de85488138247d Boris Brezillon 2024-02-29  437  		struct drm_gem_object *obj;
de85488138247d Boris Brezillon 2024-02-29  438  
de85488138247d Boris Brezillon 2024-02-29  439  		/** @offset: Offset of the synchronization object inside @bo. */
de85488138247d Boris Brezillon 2024-02-29  440  		u64 offset;
de85488138247d Boris Brezillon 2024-02-29  441  
de85488138247d Boris Brezillon 2024-02-29  442  		/**
de85488138247d Boris Brezillon 2024-02-29  443  		 * @kmap: Kernel mapping of the buffer object holding the
de85488138247d Boris Brezillon 2024-02-29  444  		 * synchronization object.
de85488138247d Boris Brezillon 2024-02-29  445  		 */
de85488138247d Boris Brezillon 2024-02-29  446  		void *kmap;
de85488138247d Boris Brezillon 2024-02-29  447  	} syncwait;
de85488138247d Boris Brezillon 2024-02-29  448  
de85488138247d Boris Brezillon 2024-02-29  449  	/** @fence_ctx: Fence context fields. */
de85488138247d Boris Brezillon 2024-02-29  450  	struct {
de85488138247d Boris Brezillon 2024-02-29  451  		/** @lock: Used to protect access to all fences allocated by this context. */
de85488138247d Boris Brezillon 2024-02-29  452  		spinlock_t lock;
de85488138247d Boris Brezillon 2024-02-29  453  
de85488138247d Boris Brezillon 2024-02-29  454  		/**
de85488138247d Boris Brezillon 2024-02-29  455  		 * @id: Fence context ID.
de85488138247d Boris Brezillon 2024-02-29  456  		 *
de85488138247d Boris Brezillon 2024-02-29  457  		 * Allocated with dma_fence_context_alloc().
de85488138247d Boris Brezillon 2024-02-29  458  		 */
de85488138247d Boris Brezillon 2024-02-29  459  		u64 id;
de85488138247d Boris Brezillon 2024-02-29  460  
de85488138247d Boris Brezillon 2024-02-29  461  		/** @seqno: Sequence number of the last initialized fence. */
de85488138247d Boris Brezillon 2024-02-29  462  		atomic64_t seqno;
de85488138247d Boris Brezillon 2024-02-29  463  
7b6f9ec6ad5112 Boris Brezillon 2024-07-03  464  		/**
7b6f9ec6ad5112 Boris Brezillon 2024-07-03  465  		 * @last_fence: Fence of the last submitted job.
7b6f9ec6ad5112 Boris Brezillon 2024-07-03  466  		 *
7b6f9ec6ad5112 Boris Brezillon 2024-07-03  467  		 * We return this fence when we get an empty command stream.
7b6f9ec6ad5112 Boris Brezillon 2024-07-03  468  		 * This way, we are guaranteed that all earlier jobs have completed
7b6f9ec6ad5112 Boris Brezillon 2024-07-03  469  		 * when drm_sched_job::s_fence::finished without having to feed
7b6f9ec6ad5112 Boris Brezillon 2024-07-03  470  		 * the CS ring buffer with a dummy job that only signals the fence.
7b6f9ec6ad5112 Boris Brezillon 2024-07-03  471  		 */
7b6f9ec6ad5112 Boris Brezillon 2024-07-03  472  		struct dma_fence *last_fence;
7b6f9ec6ad5112 Boris Brezillon 2024-07-03  473  
de85488138247d Boris Brezillon 2024-02-29  474  		/**
de85488138247d Boris Brezillon 2024-02-29  475  		 * @in_flight_jobs: List containing all in-flight jobs.
de85488138247d Boris Brezillon 2024-02-29  476  		 *
de85488138247d Boris Brezillon 2024-02-29  477  		 * Used to keep track and signal panthor_job::done_fence when the
de85488138247d Boris Brezillon 2024-02-29  478  		 * synchronization object attached to the queue is signaled.
de85488138247d Boris Brezillon 2024-02-29  479  		 */
de85488138247d Boris Brezillon 2024-02-29  480  		struct list_head in_flight_jobs;
de85488138247d Boris Brezillon 2024-02-29  481  	} fence_ctx;
6f64890b41a576 Adrián Larumbe  2024-09-03  482  
6f64890b41a576 Adrián Larumbe  2024-09-03  483  	/** @profiling_info: Job profiling data slots and access information. */
6f64890b41a576 Adrián Larumbe  2024-09-03  484  	struct {
6f64890b41a576 Adrián Larumbe  2024-09-03  485  		/** @slots: Kernel BO holding the slots. */
6f64890b41a576 Adrián Larumbe  2024-09-03  486  		struct panthor_kernel_bo *slots;
6f64890b41a576 Adrián Larumbe  2024-09-03  487  
6f64890b41a576 Adrián Larumbe  2024-09-03  488  		/** @slot_count: Number of jobs ringbuffer can hold at once. */
6f64890b41a576 Adrián Larumbe  2024-09-03  489  		u32 slot_count;
6f64890b41a576 Adrián Larumbe  2024-09-03  490  
6f64890b41a576 Adrián Larumbe  2024-09-03  491  		/** @profiling_seqno: Index of the next available profiling information slot. */
6f64890b41a576 Adrián Larumbe  2024-09-03  492  		u32 profiling_seqno;
6f64890b41a576 Adrián Larumbe  2024-09-03  493  	} profiling_info;
de85488138247d Boris Brezillon 2024-02-29 @494  };
de85488138247d Boris Brezillon 2024-02-29  495  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-09-03 20:25 ` [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
  2024-09-04  7:49   ` Boris Brezillon
  2024-09-04 16:10   ` kernel test robot
@ 2024-09-04 17:02   ` kernel test robot
  2024-09-04 18:55   ` Liviu Dudau
  3 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-09-04 17:02 UTC (permalink / raw)
  To: Adrián Larumbe, Boris Brezillon, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Daniel Vetter, Sumit Semwal, Christian König
  Cc: oe-kbuild-all, kernel, Adrián Larumbe, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig

Hi Adrián,

kernel test robot noticed the following build errors:

[auto build test ERROR on drm-misc/drm-misc-next]
[also build test ERROR on linus/master v6.11-rc6 next-20240904]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Adri-n-Larumbe/drm-panthor-introduce-job-cycle-and-timestamp-accounting/20240904-042645
base:   git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link:    https://lore.kernel.org/r/20240903202541.430225-2-adrian.larumbe%40collabora.com
patch subject: [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting
config: arc-allmodconfig (https://download.01.org/0day-ci/archive/20240905/202409050054.oRwtzLQ4-lkp@intel.com/config)
compiler: arceb-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240905/202409050054.oRwtzLQ4-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409050054.oRwtzLQ4-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from <command-line>:
   In function 'copy_instrs_to_ringbuf',
       inlined from 'queue_run_job' at drivers/gpu/drm/panthor/panthor_sched.c:3089:2:
>> include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_435' declared with attribute error: min(ringbuf_size - start, size) signedness error
     510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert'
     491 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert'
     510 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:100:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
     100 |         BUILD_BUG_ON_MSG(!__types_ok(x,y,ux,uy),        \
         |         ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:105:9: note: in expansion of macro '__careful_cmp_once'
     105 |         __careful_cmp_once(op, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_))
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:129:25: note: in expansion of macro '__careful_cmp'
     129 | #define min(x, y)       __careful_cmp(min, x, y)
         |                         ^~~~~~~~~~~~~
   drivers/gpu/drm/panthor/panthor_sched.c:2882:19: note: in expansion of macro 'min'
    2882 |         written = min(ringbuf_size - start, size);
         |                   ^~~


vim +/__compiletime_assert_435 +510 include/linux/compiler_types.h

eb5c2d4b45e3d2 Will Deacon 2020-07-21  496  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  497  #define _compiletime_assert(condition, msg, prefix, suffix) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21  498  	__compiletime_assert(condition, msg, prefix, suffix)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  499  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  500  /**
eb5c2d4b45e3d2 Will Deacon 2020-07-21  501   * compiletime_assert - break build and emit msg if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  502   * @condition: a compile-time constant condition to check
eb5c2d4b45e3d2 Will Deacon 2020-07-21  503   * @msg:       a message to emit if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  504   *
eb5c2d4b45e3d2 Will Deacon 2020-07-21  505   * In tradition of POSIX assert, this macro will break the build if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  506   * supplied condition is *false*, emitting the supplied error message if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  507   * compiler has support to do so.
eb5c2d4b45e3d2 Will Deacon 2020-07-21  508   */
eb5c2d4b45e3d2 Will Deacon 2020-07-21  509  #define compiletime_assert(condition, msg) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21 @510  	_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  511  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 2/4] drm/panthor: add DRM fdinfo support
  2024-09-03 20:25 ` [PATCH v5 2/4] drm/panthor: add DRM fdinfo support Adrián Larumbe
  2024-09-04  8:01   ` Boris Brezillon
@ 2024-09-04 17:45   ` kernel test robot
  1 sibling, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-09-04 17:45 UTC (permalink / raw)
  To: Adrián Larumbe, Boris Brezillon, Steven Price, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Daniel Vetter, Sumit Semwal, Christian König
  Cc: oe-kbuild-all, kernel, Adrián Larumbe, dri-devel,
	linux-kernel, linux-media, linaro-mm-sig

Hi Adrián,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-misc/drm-misc-next]
[also build test WARNING on linus/master v6.11-rc6 next-20240904]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Adri-n-Larumbe/drm-panthor-introduce-job-cycle-and-timestamp-accounting/20240904-042645
base:   git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link:    https://lore.kernel.org/r/20240903202541.430225-3-adrian.larumbe%40collabora.com
patch subject: [PATCH v5 2/4] drm/panthor: add DRM fdinfo support
config: x86_64-buildonly-randconfig-002-20240904 (https://download.01.org/0day-ci/archive/20240905/202409050134.uxrIkhqc-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240905/202409050134.uxrIkhqc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409050134.uxrIkhqc-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'runnable' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'idle' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'waiting' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'has_ref' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'in_progress' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:322: warning: Excess struct member 'stopped_groups' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'mem' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'input' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'output' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'input_fw_va' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'output_fw_va' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'gpu_va' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'ref' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'gt' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'sync64' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'bo' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'offset' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'kmap' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'lock' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'id' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'seqno' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'last_fence' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'in_flight_jobs' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'slots' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'slot_count' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:494: warning: Excess struct member 'profiling_seqno' description in 'panthor_queue'
>> drivers/gpu/drm/panthor/panthor_sched.c:689: warning: Excess struct member 'data' description in 'panthor_group'
>> drivers/gpu/drm/panthor/panthor_sched.c:689: warning: Excess struct member 'lock' description in 'panthor_group'
   drivers/gpu/drm/panthor/panthor_sched.c:822: warning: Function parameter or struct member 'profiling_slot' not described in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:822: warning: Excess struct member 'start' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:822: warning: Excess struct member 'size' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:822: warning: Excess struct member 'latest_flush' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:822: warning: Excess struct member 'start' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:822: warning: Excess struct member 'end' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:822: warning: Excess struct member 'profile_slot' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:1745: warning: Function parameter or struct member 'ptdev' not described in 'panthor_sched_report_fw_events'
   drivers/gpu/drm/panthor/panthor_sched.c:1745: warning: Function parameter or struct member 'events' not described in 'panthor_sched_report_fw_events'
   drivers/gpu/drm/panthor/panthor_sched.c:2637: warning: Function parameter or struct member 'ptdev' not described in 'panthor_sched_report_mmu_fault'


vim +689 drivers/gpu/drm/panthor/panthor_sched.c

de85488138247d0 Boris Brezillon 2024-02-29  531  
de85488138247d0 Boris Brezillon 2024-02-29  532  /**
de85488138247d0 Boris Brezillon 2024-02-29  533   * struct panthor_group - Scheduling group object
de85488138247d0 Boris Brezillon 2024-02-29  534   */
de85488138247d0 Boris Brezillon 2024-02-29  535  struct panthor_group {
de85488138247d0 Boris Brezillon 2024-02-29  536  	/** @refcount: Reference count */
de85488138247d0 Boris Brezillon 2024-02-29  537  	struct kref refcount;
de85488138247d0 Boris Brezillon 2024-02-29  538  
de85488138247d0 Boris Brezillon 2024-02-29  539  	/** @ptdev: Device. */
de85488138247d0 Boris Brezillon 2024-02-29  540  	struct panthor_device *ptdev;
de85488138247d0 Boris Brezillon 2024-02-29  541  
de85488138247d0 Boris Brezillon 2024-02-29  542  	/** @vm: VM bound to the group. */
de85488138247d0 Boris Brezillon 2024-02-29  543  	struct panthor_vm *vm;
de85488138247d0 Boris Brezillon 2024-02-29  544  
de85488138247d0 Boris Brezillon 2024-02-29  545  	/** @compute_core_mask: Mask of shader cores that can be used for compute jobs. */
de85488138247d0 Boris Brezillon 2024-02-29  546  	u64 compute_core_mask;
de85488138247d0 Boris Brezillon 2024-02-29  547  
de85488138247d0 Boris Brezillon 2024-02-29  548  	/** @fragment_core_mask: Mask of shader cores that can be used for fragment jobs. */
de85488138247d0 Boris Brezillon 2024-02-29  549  	u64 fragment_core_mask;
de85488138247d0 Boris Brezillon 2024-02-29  550  
de85488138247d0 Boris Brezillon 2024-02-29  551  	/** @tiler_core_mask: Mask of tiler cores that can be used for tiler jobs. */
de85488138247d0 Boris Brezillon 2024-02-29  552  	u64 tiler_core_mask;
de85488138247d0 Boris Brezillon 2024-02-29  553  
de85488138247d0 Boris Brezillon 2024-02-29  554  	/** @max_compute_cores: Maximum number of shader cores used for compute jobs. */
de85488138247d0 Boris Brezillon 2024-02-29  555  	u8 max_compute_cores;
de85488138247d0 Boris Brezillon 2024-02-29  556  
be7ffc821f5fc2e Liviu Dudau     2024-04-02  557  	/** @max_fragment_cores: Maximum number of shader cores used for fragment jobs. */
de85488138247d0 Boris Brezillon 2024-02-29  558  	u8 max_fragment_cores;
de85488138247d0 Boris Brezillon 2024-02-29  559  
de85488138247d0 Boris Brezillon 2024-02-29  560  	/** @max_tiler_cores: Maximum number of tiler cores used for tiler jobs. */
de85488138247d0 Boris Brezillon 2024-02-29  561  	u8 max_tiler_cores;
de85488138247d0 Boris Brezillon 2024-02-29  562  
de85488138247d0 Boris Brezillon 2024-02-29  563  	/** @priority: Group priority (check panthor_csg_priority). */
de85488138247d0 Boris Brezillon 2024-02-29  564  	u8 priority;
de85488138247d0 Boris Brezillon 2024-02-29  565  
de85488138247d0 Boris Brezillon 2024-02-29  566  	/** @blocked_queues: Bitmask reflecting the blocked queues. */
de85488138247d0 Boris Brezillon 2024-02-29  567  	u32 blocked_queues;
de85488138247d0 Boris Brezillon 2024-02-29  568  
de85488138247d0 Boris Brezillon 2024-02-29  569  	/** @idle_queues: Bitmask reflecting the idle queues. */
de85488138247d0 Boris Brezillon 2024-02-29  570  	u32 idle_queues;
de85488138247d0 Boris Brezillon 2024-02-29  571  
de85488138247d0 Boris Brezillon 2024-02-29  572  	/** @fatal_lock: Lock used to protect access to fatal fields. */
de85488138247d0 Boris Brezillon 2024-02-29  573  	spinlock_t fatal_lock;
de85488138247d0 Boris Brezillon 2024-02-29  574  
de85488138247d0 Boris Brezillon 2024-02-29  575  	/** @fatal_queues: Bitmask reflecting the queues that hit a fatal exception. */
de85488138247d0 Boris Brezillon 2024-02-29  576  	u32 fatal_queues;
de85488138247d0 Boris Brezillon 2024-02-29  577  
de85488138247d0 Boris Brezillon 2024-02-29  578  	/** @tiler_oom: Mask of queues that have a tiler OOM event to process. */
de85488138247d0 Boris Brezillon 2024-02-29  579  	atomic_t tiler_oom;
de85488138247d0 Boris Brezillon 2024-02-29  580  
de85488138247d0 Boris Brezillon 2024-02-29  581  	/** @queue_count: Number of queues in this group. */
de85488138247d0 Boris Brezillon 2024-02-29  582  	u32 queue_count;
de85488138247d0 Boris Brezillon 2024-02-29  583  
de85488138247d0 Boris Brezillon 2024-02-29  584  	/** @queues: Queues owned by this group. */
de85488138247d0 Boris Brezillon 2024-02-29  585  	struct panthor_queue *queues[MAX_CS_PER_CSG];
de85488138247d0 Boris Brezillon 2024-02-29  586  
de85488138247d0 Boris Brezillon 2024-02-29  587  	/**
de85488138247d0 Boris Brezillon 2024-02-29  588  	 * @csg_id: ID of the FW group slot.
de85488138247d0 Boris Brezillon 2024-02-29  589  	 *
de85488138247d0 Boris Brezillon 2024-02-29  590  	 * -1 when the group is not scheduled/active.
de85488138247d0 Boris Brezillon 2024-02-29  591  	 */
de85488138247d0 Boris Brezillon 2024-02-29  592  	int csg_id;
de85488138247d0 Boris Brezillon 2024-02-29  593  
de85488138247d0 Boris Brezillon 2024-02-29  594  	/**
de85488138247d0 Boris Brezillon 2024-02-29  595  	 * @destroyed: True when the group has been destroyed.
de85488138247d0 Boris Brezillon 2024-02-29  596  	 *
de85488138247d0 Boris Brezillon 2024-02-29  597  	 * If a group is destroyed it becomes useless: no further jobs can be submitted
de85488138247d0 Boris Brezillon 2024-02-29  598  	 * to its queues. We simply wait for all references to be dropped so we can
de85488138247d0 Boris Brezillon 2024-02-29  599  	 * release the group object.
de85488138247d0 Boris Brezillon 2024-02-29  600  	 */
de85488138247d0 Boris Brezillon 2024-02-29  601  	bool destroyed;
de85488138247d0 Boris Brezillon 2024-02-29  602  
de85488138247d0 Boris Brezillon 2024-02-29  603  	/**
de85488138247d0 Boris Brezillon 2024-02-29  604  	 * @timedout: True when a timeout occurred on any of the queues owned by
de85488138247d0 Boris Brezillon 2024-02-29  605  	 * this group.
de85488138247d0 Boris Brezillon 2024-02-29  606  	 *
de85488138247d0 Boris Brezillon 2024-02-29  607  	 * Timeouts can be reported by drm_sched or by the FW. In any case, any
de85488138247d0 Boris Brezillon 2024-02-29  608  	 * timeout situation is unrecoverable, and the group becomes useless.
de85488138247d0 Boris Brezillon 2024-02-29  609  	 * We simply wait for all references to be dropped so we can release the
de85488138247d0 Boris Brezillon 2024-02-29  610  	 * group object.
de85488138247d0 Boris Brezillon 2024-02-29  611  	 */
de85488138247d0 Boris Brezillon 2024-02-29  612  	bool timedout;
de85488138247d0 Boris Brezillon 2024-02-29  613  
de85488138247d0 Boris Brezillon 2024-02-29  614  	/**
de85488138247d0 Boris Brezillon 2024-02-29  615  	 * @syncobjs: Pool of per-queue synchronization objects.
de85488138247d0 Boris Brezillon 2024-02-29  616  	 *
de85488138247d0 Boris Brezillon 2024-02-29  617  	 * One sync object per queue. The position of the sync object is
de85488138247d0 Boris Brezillon 2024-02-29  618  	 * determined by the queue index.
de85488138247d0 Boris Brezillon 2024-02-29  619  	 */
de85488138247d0 Boris Brezillon 2024-02-29  620  	struct panthor_kernel_bo *syncobjs;
de85488138247d0 Boris Brezillon 2024-02-29  621  
d7baaf2591f58fc Adrián Larumbe  2024-09-03  622  	/** @fdinfo: Per-file total cycle and timestamp values reference. */
d7baaf2591f58fc Adrián Larumbe  2024-09-03  623  	struct {
d7baaf2591f58fc Adrián Larumbe  2024-09-03  624  		/** @data: Pointer to actual per-file sample data. */
d7baaf2591f58fc Adrián Larumbe  2024-09-03  625  		struct panthor_gpu_usage *data;
d7baaf2591f58fc Adrián Larumbe  2024-09-03  626  
d7baaf2591f58fc Adrián Larumbe  2024-09-03  627  		/**
d7baaf2591f58fc Adrián Larumbe  2024-09-03  628  		 * @lock: Mutex to govern concurrent access from drm file's fdinfo callback
d7baaf2591f58fc Adrián Larumbe  2024-09-03  629  		 * and job post-completion processing function
d7baaf2591f58fc Adrián Larumbe  2024-09-03  630  		 */
d7baaf2591f58fc Adrián Larumbe  2024-09-03  631  		struct mutex lock;
d7baaf2591f58fc Adrián Larumbe  2024-09-03  632  	} fdinfo;
d7baaf2591f58fc Adrián Larumbe  2024-09-03  633  
de85488138247d0 Boris Brezillon 2024-02-29  634  	/** @state: Group state. */
de85488138247d0 Boris Brezillon 2024-02-29  635  	enum panthor_group_state state;
de85488138247d0 Boris Brezillon 2024-02-29  636  
de85488138247d0 Boris Brezillon 2024-02-29  637  	/**
de85488138247d0 Boris Brezillon 2024-02-29  638  	 * @suspend_buf: Suspend buffer.
de85488138247d0 Boris Brezillon 2024-02-29  639  	 *
de85488138247d0 Boris Brezillon 2024-02-29  640  	 * Stores the state of the group and its queues when a group is suspended.
de85488138247d0 Boris Brezillon 2024-02-29  641  	 * Used at resume time to restore the group in its previous state.
de85488138247d0 Boris Brezillon 2024-02-29  642  	 *
de85488138247d0 Boris Brezillon 2024-02-29  643  	 * The size of the suspend buffer is exposed through the FW interface.
de85488138247d0 Boris Brezillon 2024-02-29  644  	 */
de85488138247d0 Boris Brezillon 2024-02-29  645  	struct panthor_kernel_bo *suspend_buf;
de85488138247d0 Boris Brezillon 2024-02-29  646  
de85488138247d0 Boris Brezillon 2024-02-29  647  	/**
de85488138247d0 Boris Brezillon 2024-02-29  648  	 * @protm_suspend_buf: Protection mode suspend buffer.
de85488138247d0 Boris Brezillon 2024-02-29  649  	 *
de85488138247d0 Boris Brezillon 2024-02-29  650  	 * Stores the state of the group and its queues when a group that's in
de85488138247d0 Boris Brezillon 2024-02-29  651  	 * protection mode is suspended.
de85488138247d0 Boris Brezillon 2024-02-29  652  	 *
de85488138247d0 Boris Brezillon 2024-02-29  653  	 * Used at resume time to restore the group in its previous state.
de85488138247d0 Boris Brezillon 2024-02-29  654  	 *
de85488138247d0 Boris Brezillon 2024-02-29  655  	 * The size of the protection mode suspend buffer is exposed through the
de85488138247d0 Boris Brezillon 2024-02-29  656  	 * FW interface.
de85488138247d0 Boris Brezillon 2024-02-29  657  	 */
de85488138247d0 Boris Brezillon 2024-02-29  658  	struct panthor_kernel_bo *protm_suspend_buf;
de85488138247d0 Boris Brezillon 2024-02-29  659  
de85488138247d0 Boris Brezillon 2024-02-29  660  	/** @sync_upd_work: Work used to check/signal job fences. */
de85488138247d0 Boris Brezillon 2024-02-29  661  	struct work_struct sync_upd_work;
de85488138247d0 Boris Brezillon 2024-02-29  662  
de85488138247d0 Boris Brezillon 2024-02-29  663  	/** @tiler_oom_work: Work used to process tiler OOM events happening on this group. */
de85488138247d0 Boris Brezillon 2024-02-29  664  	struct work_struct tiler_oom_work;
de85488138247d0 Boris Brezillon 2024-02-29  665  
de85488138247d0 Boris Brezillon 2024-02-29  666  	/** @term_work: Work used to finish the group termination procedure. */
de85488138247d0 Boris Brezillon 2024-02-29  667  	struct work_struct term_work;
de85488138247d0 Boris Brezillon 2024-02-29  668  
de85488138247d0 Boris Brezillon 2024-02-29  669  	/**
de85488138247d0 Boris Brezillon 2024-02-29  670  	 * @release_work: Work used to release group resources.
de85488138247d0 Boris Brezillon 2024-02-29  671  	 *
de85488138247d0 Boris Brezillon 2024-02-29  672  	 * We need to postpone the group release to avoid a deadlock when
de85488138247d0 Boris Brezillon 2024-02-29  673  	 * the last ref is released in the tick work.
de85488138247d0 Boris Brezillon 2024-02-29  674  	 */
de85488138247d0 Boris Brezillon 2024-02-29  675  	struct work_struct release_work;
de85488138247d0 Boris Brezillon 2024-02-29  676  
de85488138247d0 Boris Brezillon 2024-02-29  677  	/**
de85488138247d0 Boris Brezillon 2024-02-29  678  	 * @run_node: Node used to insert the group in the
de85488138247d0 Boris Brezillon 2024-02-29  679  	 * panthor_group::groups::{runnable,idle} and
de85488138247d0 Boris Brezillon 2024-02-29  680  	 * panthor_group::reset.stopped_groups lists.
de85488138247d0 Boris Brezillon 2024-02-29  681  	 */
de85488138247d0 Boris Brezillon 2024-02-29  682  	struct list_head run_node;
de85488138247d0 Boris Brezillon 2024-02-29  683  
de85488138247d0 Boris Brezillon 2024-02-29  684  	/**
de85488138247d0 Boris Brezillon 2024-02-29  685  	 * @wait_node: Node used to insert the group in the
de85488138247d0 Boris Brezillon 2024-02-29  686  	 * panthor_group::groups::waiting list.
de85488138247d0 Boris Brezillon 2024-02-29  687  	 */
de85488138247d0 Boris Brezillon 2024-02-29  688  	struct list_head wait_node;
de85488138247d0 Boris Brezillon 2024-02-29 @689  };
de85488138247d0 Boris Brezillon 2024-02-29  690  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-09-03 20:25 ` [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
                     ` (2 preceding siblings ...)
  2024-09-04 17:02   ` kernel test robot
@ 2024-09-04 18:55   ` Liviu Dudau
  3 siblings, 0 replies; 15+ messages in thread
From: Liviu Dudau @ 2024-09-04 18:55 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Boris Brezillon, Steven Price, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König, kernel, dri-devel, linux-kernel,
	linux-media, linaro-mm-sig

On Tue, Sep 03, 2024 at 09:25:35PM +0100, Adrián Larumbe wrote:
> Enable calculations of job submission times in clock cycles and wall
> time. This is done by expanding the boilerplate command stream when running
> a job to include instructions that compute said times right before an after

s/an/and/

> a user CS.
> 
> A separate kernel BO is created per queue to store those values. Jobs can
> access their sampled data through a slots buffer-specific index different

s/slots/slot's/ ?

> from that of the queue's ringbuffer. The reason for this is saving memory
> on the profiling information kernel BO, since the amount of simultaneous
> profiled jobs we can write into the queue's ringbuffer might be much
> smaller than for regular jobs, as the former take more CSF instructions.
> 
> This commit is done in preparation for enabling DRM fdinfo support in the
> Panthor driver, which depends on the numbers calculated herein.
> 
> A profile mode mask has been added that will in a future commit allow UM to
> toggle performance metric sampling behaviour, which is disabled by default
> to save power. When a ringbuffer CS is constructed, timestamp and cycling
> sampling instructions are added depending on the enabled flags in the
> profiling mask.
> 
> A helper was provided that calculates the number of instructions for a
> given set of enablement mask, and these are passed as the number of credits
> when initialising a DRM scheduler job.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.h |  22 ++
>  drivers/gpu/drm/panthor/panthor_sched.c  | 327 ++++++++++++++++++++---
>  2 files changed, 305 insertions(+), 44 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index e388c0472ba7..a48e30d0af30 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -66,6 +66,25 @@ struct panthor_irq {
>  	atomic_t suspended;
>  };
>  
> +/**
> + * enum panthor_device_profiling_mode - Profiling state
> + */
> +enum panthor_device_profiling_flags {
> +	/** @PANTHOR_DEVICE_PROFILING_DISABLED: Profiling is disabled. */
> +	PANTHOR_DEVICE_PROFILING_DISABLED = 0,
> +
> +	/** @PANTHOR_DEVICE_PROFILING_CYCLES: Sampling job cycles. */
> +	PANTHOR_DEVICE_PROFILING_CYCLES = BIT(0),
> +
> +	/** @PANTHOR_DEVICE_PROFILING_TIMESTAMP: Sampling job timestamp. */
> +	PANTHOR_DEVICE_PROFILING_TIMESTAMP = BIT(1),
> +
> +	/** @PANTHOR_DEVICE_PROFILING_ALL: Sampling everything. */
> +	PANTHOR_DEVICE_PROFILING_ALL =
> +	PANTHOR_DEVICE_PROFILING_CYCLES |
> +	PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> +};
> +
>  /**
>   * struct panthor_device - Panthor device
>   */
> @@ -162,6 +181,9 @@ struct panthor_device {
>  		 */
>  		struct page *dummy_latest_flush;
>  	} pm;
> +
> +	/** @profile_mask: User-set profiling flags for job accounting. */
> +	u32 profile_mask;
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index c426a392b081..b087648bf59a 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -93,6 +93,9 @@
>  #define MIN_CSGS				3
>  #define MAX_CSG_PRIO				0xf
>  
> +#define NUM_INSTRS_PER_CACHE_LINE		(64 / sizeof(u64))
> +#define MAX_INSTRS_PER_JOB			32
> +
>  struct panthor_group;
>  
>  /**
> @@ -476,6 +479,18 @@ struct panthor_queue {
>  		 */
>  		struct list_head in_flight_jobs;
>  	} fence_ctx;
> +
> +	/** @profiling_info: Job profiling data slots and access information. */
> +	struct {
> +		/** @slots: Kernel BO holding the slots. */
> +		struct panthor_kernel_bo *slots;
> +
> +		/** @slot_count: Number of jobs ringbuffer can hold at once. */
> +		u32 slot_count;
> +
> +		/** @profiling_seqno: Index of the next available profiling information slot. */
> +		u32 profiling_seqno;
> +	} profiling_info;
>  };
>  
>  /**
> @@ -661,6 +676,18 @@ struct panthor_group {
>  	struct list_head wait_node;
>  };
>  
> +struct panthor_job_profiling_data {
> +	struct {
> +		u64 before;
> +		u64 after;
> +	} cycles;
> +
> +	struct {
> +		u64 before;
> +		u64 after;
> +	} time;
> +};
> +
>  /**
>   * group_queue_work() - Queue a group work
>   * @group: Group to queue the work for.
> @@ -774,6 +801,12 @@ struct panthor_job {
>  
>  	/** @done_fence: Fence signaled when the job is finished or cancelled. */
>  	struct dma_fence *done_fence;
> +
> +	/** @profile_mask: Current device job profiling enablement bitmask. */
> +	u32 profile_mask;
> +
> +	/** @profile_slot: Job profiling information index in the profiling slots BO. */
> +	u32 profiling_slot;
>  };
>  
>  static void
> @@ -838,6 +871,7 @@ static void group_free_queue(struct panthor_group *group, struct panthor_queue *
>  
>  	panthor_kernel_bo_destroy(queue->ringbuf);
>  	panthor_kernel_bo_destroy(queue->iface.mem);
> +	panthor_kernel_bo_destroy(queue->profiling_info.slots);
>  
>  	/* Release the last_fence we were holding, if any. */
>  	dma_fence_put(queue->fence_ctx.last_fence);
> @@ -1982,8 +2016,6 @@ tick_ctx_init(struct panthor_scheduler *sched,
>  	}
>  }
>  
> -#define NUM_INSTRS_PER_SLOT		16
> -
>  static void
>  group_term_post_processing(struct panthor_group *group)
>  {
> @@ -2815,65 +2847,211 @@ static void group_sync_upd_work(struct work_struct *work)
>  	group_put(group);
>  }
>  
> -static struct dma_fence *
> -queue_run_job(struct drm_sched_job *sched_job)
> +struct panthor_job_ringbuf_instrs {
> +	u64 buffer[MAX_INSTRS_PER_JOB];
> +	u32 count;
> +};
> +
> +struct panthor_job_instr {
> +	u32 profile_mask;
> +	u64 instr;
> +};
> +
> +#define JOB_INSTR(__prof, __instr) \
> +	{ \
> +		.profile_mask = __prof, \
> +		.instr = __instr, \
> +	}
> +
> +static void
> +copy_instrs_to_ringbuf(struct panthor_queue *queue,
> +		       struct panthor_job *job,
> +		       struct panthor_job_ringbuf_instrs *instrs)
> +{
> +	ssize_t ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
> +	u32 start = job->ringbuf.start & (ringbuf_size - 1);
> +	ssize_t size, written;
> +
> +	/*
> +	 * We need to write a whole slot, including any trailing zeroes
> +	 * that may come at the end of it. Also, because instrs.buffer had

s/had/has/

> +	 * been zero-initialised, there's no need to pad it with 0's
> +	 */
> +	instrs->count = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE);
> +	size = instrs->count * sizeof(u64);
> +	written = min(ringbuf_size - start, size);
> +
> +	memcpy(queue->ringbuf->kmap + start, instrs->buffer, written);
> +
> +	if (written < size)
> +		memcpy(queue->ringbuf->kmap,
> +		       &instrs->buffer[(ringbuf_size - start)/sizeof(u64)],
> +		       size - written);
> +}
> +
> +struct panthor_job_cs_params {
> +	u32 profile_mask;
> +	u64 addr_reg; u64 val_reg;
> +	u64 cycle_reg; u64 time_reg;
> +	u64 sync_addr; u64 times_addr;
> +	u64 cs_start; u64 cs_size;
> +	u32 last_flush; u32 waitall_mask;
> +};
> +
> +static void
> +get_job_cs_params(struct panthor_job *job, struct panthor_job_cs_params *params)
>  {
> -	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
>  	struct panthor_group *group = job->group;
>  	struct panthor_queue *queue = group->queues[job->queue_idx];
>  	struct panthor_device *ptdev = group->ptdev;
>  	struct panthor_scheduler *sched = ptdev->scheduler;
> -	u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
> -	u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1);
> -	u64 addr_reg = ptdev->csif_info.cs_reg_count -
> -		       ptdev->csif_info.unpreserved_cs_reg_count;
> -	u64 val_reg = addr_reg + 2;
> -	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> -			job->queue_idx * sizeof(struct panthor_syncobj_64b);
> -	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
> -	struct dma_fence *done_fence;
> -	int ret;
>  
> -	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
> +	params->addr_reg = ptdev->csif_info.cs_reg_count -
> +			   ptdev->csif_info.unpreserved_cs_reg_count;
> +	params->val_reg = params->addr_reg + 2;
> +	params->cycle_reg = params->addr_reg;
> +	params->time_reg = params->val_reg;
> +
> +	params->sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> +			    job->queue_idx * sizeof(struct panthor_syncobj_64b);
> +	params->times_addr = panthor_kernel_bo_gpuva(queue->profiling_info.slots) +
> +			     (job->profiling_slot * sizeof(struct panthor_job_profiling_data));
> +	params->waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
> +
> +	params->cs_start = job->call_info.start;
> +	params->cs_size = job->call_info.size;
> +	params->last_flush = job->call_info.latest_flush;
> +
> +	params->profile_mask = job->profile_mask;
> +}
> +
> +static void
> +prepare_job_instrs(const struct panthor_job_cs_params *params,
> +		   struct panthor_job_ringbuf_instrs *instrs)
> +{
> +	const struct panthor_job_instr instr_seq[] = {
>  		/* MOV32 rX+2, cs.latest_flush */
> -		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (2ull << 56) | (params->val_reg << 48) | params->last_flush),
>  
>  		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
> -		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (36ull << 56) | (0ull << 48) | (params->val_reg << 40) | (0 << 16) | 0x233),
> +
> +		/* MOV48 rX:rX+1, cycles_offset */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> +			  (1ull << 56) | (params->cycle_reg << 48) |
> +			  (params->times_addr +
> +			   offsetof(struct panthor_job_profiling_data, cycles.before))),
> +		/* STORE_STATE cycles */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> +			  (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)),
> +		/* MOV48 rX:rX+1, time_offset */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> +			  (1ull << 56) | (params->time_reg << 48) |
> +			  (params->times_addr +
> +			   offsetof(struct panthor_job_profiling_data, time.before))),
> +		/* STORE_STATE timer */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> +			  (40ull << 56) | (params->time_reg << 40) | (0ll << 32)),
>  
>  		/* MOV48 rX:rX+1, cs.start */
> -		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (1ull << 56) | (params->addr_reg << 48) | params->cs_start),
>  		/* MOV32 rX+2, cs.size */
> -		(2ull << 56) | (val_reg << 48) | job->call_info.size,
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (2ull << 56) | (params->val_reg << 48) | params->cs_size),
>  		/* WAIT(0) => waits for FLUSH_CACHE2 instruction */
> -		(3ull << 56) | (1 << 16),
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (3ull << 56) | (1 << 16)),
>  		/* CALL rX:rX+1, rX+2 */
> -		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (32ull << 56) | (params->addr_reg << 40) | (params->val_reg << 32)),
> +
> +		/* MOV48 rX:rX+1, cycles_offset */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> +			  (1ull << 56) | (params->cycle_reg << 48) |
> +			  (params->times_addr +
> +			   offsetof(struct panthor_job_profiling_data, cycles.after))),
> +		/* STORE_STATE cycles */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> +			  (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)),
> +
> +		/* MOV48 rX:rX+1, time_offset */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> +			  (1ull << 56) | (params->time_reg << 48) |
> +			  (params->times_addr +
> +			   offsetof(struct panthor_job_profiling_data, time.after))),
> +		/* STORE_STATE timer */
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> +			  (40ull << 56) | (params->time_reg << 40) | (0ll << 32)),
>  
>  		/* MOV48 rX:rX+1, sync_addr */
> -		(1ull << 56) | (addr_reg << 48) | sync_addr,
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (1ull << 56) | (params->addr_reg << 48) | params->sync_addr),
>  		/* MOV48 rX+2, #1 */
> -		(1ull << 56) | (val_reg << 48) | 1,
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (1ull << 56) | (params->val_reg << 48) | 1),
>  		/* WAIT(all) */
> -		(3ull << 56) | (waitall_mask << 16),
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (3ull << 56) | (params->waitall_mask << 16)),
>  		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
> -		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
> -
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> +			  (51ull << 56) | (0ull << 48) | (params->addr_reg << 40) |
> +			  (params->val_reg << 32) | (0 << 16) | 1),
>  		/* ERROR_BARRIER, so we can recover from faults at job
>  		 * boundaries.
>  		 */
> -		(47ull << 56),
> +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (47ull << 56)),
> +	};
> +	u32 pad;
> +
> +	/* NEED to be cacheline aligned to please the prefetcher. */
> +	static_assert(sizeof(instrs->buffer) % 64 == 0,
> +		      "panthor_job_ringbuf_instrs::buffer is not aligned on a cacheline");
> +
> +	/* Make sure we have enough storage to store the whole sequence. */
> +	static_assert(ALIGN(ARRAY_SIZE(instr_seq), NUM_INSTRS_PER_CACHE_LINE) <=
> +		      ARRAY_SIZE(instrs->buffer),
> +		      "instr_seq vs panthor_job_ringbuf_instrs::buffer size mismatch");
> +
> +	for (u32 i = 0; i < ARRAY_SIZE(instr_seq); i++) {
> +		/* If the profile mask of this instruction is not enabled, skip it. */
> +		if (instr_seq[i].profile_mask &&
> +		    !(instr_seq[i].profile_mask & params->profile_mask))
> +			continue;
> +
> +		instrs->buffer[instrs->count++] = instr_seq[i].instr;
> +	}
> +
> +	pad = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE);
> +	memset(&instrs->buffer[instrs->count], 0,
> +	       (pad - instrs->count) * sizeof(instrs->buffer[0]));
> +	instrs->count = pad;
> +}
> +
> +static u32 calc_job_credits(u32 profile_mask)
> +{
> +	struct panthor_job_ringbuf_instrs instrs;
> +	struct panthor_job_cs_params params = {
> +		.profile_mask = profile_mask,
>  	};
>  
> -	/* Need to be cacheline aligned to please the prefetcher. */
> -	static_assert(sizeof(call_instrs) % 64 == 0,
> -		      "call_instrs is not aligned on a cacheline");
> +	prepare_job_instrs(&params, &instrs);
> +	return instrs.count;
> +}
> +
> +static struct dma_fence *
> +queue_run_job(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> +	struct panthor_group *group = job->group;
> +	struct panthor_queue *queue = group->queues[job->queue_idx];
> +	struct panthor_device *ptdev = group->ptdev;
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_job_ringbuf_instrs instrs;
> +	struct panthor_job_cs_params cs_params;
> +	struct dma_fence *done_fence;
> +	int ret;
>  
>  	/* Stream size is zero, nothing to do except making sure all previously
>  	 * submitted jobs are done before we signal the
> @@ -2900,17 +3078,23 @@ queue_run_job(struct drm_sched_job *sched_job)
>  		       queue->fence_ctx.id,
>  		       atomic64_inc_return(&queue->fence_ctx.seqno));
>  
> -	memcpy(queue->ringbuf->kmap + ringbuf_insert,
> -	       call_instrs, sizeof(call_instrs));
> +	job->profiling_slot = queue->profiling_info.profiling_seqno++;
> +	if (queue->profiling_info.profiling_seqno == queue->profiling_info.slot_count)
> +		queue->profiling_info.profiling_seqno = 0;
> +
> +	job->ringbuf.start = queue->iface.input->insert;
> +
> +	get_job_cs_params(job, &cs_params);
> +	prepare_job_instrs(&cs_params, &instrs);
> +	copy_instrs_to_ringbuf(queue, job, &instrs);
> +
> +	job->ringbuf.end = job->ringbuf.start + (instrs.count * sizeof(u64));
>  
>  	panthor_job_get(&job->base);
>  	spin_lock(&queue->fence_ctx.lock);
>  	list_add_tail(&job->node, &queue->fence_ctx.in_flight_jobs);
>  	spin_unlock(&queue->fence_ctx.lock);
>  
> -	job->ringbuf.start = queue->iface.input->insert;
> -	job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
> -
>  	/* Make sure the ring buffer is updated before the INSERT
>  	 * register.
>  	 */
> @@ -3003,6 +3187,24 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>  	.free_job = queue_free_job,
>  };
>  
> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
> +				       u32 cs_ringbuf_size)
> +{
> +	u32 min_profiled_job_instrs = U32_MAX;
> +	u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
> +
> +	for (u32 i = 0; i < last_flag; i++) {
> +		if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
> +			min_profiled_job_instrs =
> +				min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
> +	}
> +
> +	drm_WARN_ON(&ptdev->base,
> +		    !IS_ALIGNED(min_profiled_job_instrs, NUM_INSTRS_PER_CACHE_LINE));
> +
> +	return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
> +}
> +
>  static struct panthor_queue *
>  group_create_queue(struct panthor_group *group,
>  		   const struct drm_panthor_queue_create *args)
> @@ -3056,9 +3258,38 @@ group_create_queue(struct panthor_group *group,
>  		goto err_free_queue;
>  	}
>  
> +	queue->profiling_info.slot_count =
> +		calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
> +
> +	queue->profiling_info.slots =
> +		panthor_kernel_bo_create(group->ptdev, group->vm,
> +					 queue->profiling_info.slot_count *
> +					 sizeof(struct panthor_job_profiling_data),
> +					 DRM_PANTHOR_BO_NO_MMAP,
> +					 DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> +					 DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> +					 PANTHOR_VM_KERNEL_AUTO_VA);
> +
> +	if (IS_ERR(queue->profiling_info.slots)) {
> +		ret = PTR_ERR(queue->profiling_info.slots);
> +		goto err_free_queue;
> +	}
> +
> +	ret = panthor_kernel_bo_vmap(queue->profiling_info.slots);
> +	if (ret)
> +		goto err_free_queue;
> +
> +	memset(queue->profiling_info.slots->kmap, 0,
> +	       queue->profiling_info.slot_count * sizeof(struct panthor_job_profiling_data));
> +
> +	/*
> +	 * Credit limit argument tells us the total number of instructions
> +	 * across all CS slots in the ringbuffer, with some jobs requiring
> +	 * twice as many as others, depending on their profiling status.
> +	 */
>  	ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>  			     group->ptdev->scheduler->wq, 1,
> -			     args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
> +			     args->ringbuf_size / sizeof(u64),
>  			     0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>  			     group->ptdev->reset.wq,
>  			     NULL, "panthor-queue", group->ptdev->base.dev);
> @@ -3354,6 +3585,7 @@ panthor_job_create(struct panthor_file *pfile,
>  {
>  	struct panthor_group_pool *gpool = pfile->groups;
>  	struct panthor_job *job;
> +	u32 credits;
>  	int ret;
>  
>  	if (qsubmit->pad)
> @@ -3407,9 +3639,16 @@ panthor_job_create(struct panthor_file *pfile,
>  		}
>  	}
>  
> +	job->profile_mask = pfile->ptdev->profile_mask;
> +	credits = calc_job_credits(job->profile_mask);
> +	if (credits == 0) {
> +		ret = -EINVAL;
> +		goto err_put_job;
> +	}
> +
>  	ret = drm_sched_job_init(&job->base,
>  				 &job->group->queues[job->queue_idx]->entity,
> -				 1, job->group);
> +				 credits, job->group);
>  	if (ret)
>  		goto err_put_job;
>  
> -- 
> 2.46.0
> 

Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>

Best regards,
Liviu


-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-09-04  7:49   ` Boris Brezillon
@ 2024-09-12 15:03     ` Adrián Larumbe
  2024-09-12 15:10       ` Boris Brezillon
  0 siblings, 1 reply; 15+ messages in thread
From: Adrián Larumbe @ 2024-09-12 15:03 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König, kernel, dri-devel, linux-kernel,
	linux-media, linaro-mm-sig

Hi Boris, thanks for the remarks,

On 04.09.2024 09:49, Boris Brezillon wrote:
> On Tue,  3 Sep 2024 21:25:35 +0100
> Adrián Larumbe <adrian.larumbe@collabora.com> wrote:
> 
> > Enable calculations of job submission times in clock cycles and wall
> > time. This is done by expanding the boilerplate command stream when running
> > a job to include instructions that compute said times right before an after
> > a user CS.
> > 
> > A separate kernel BO is created per queue to store those values. Jobs can
> > access their sampled data through a slots buffer-specific index different
> > from that of the queue's ringbuffer. The reason for this is saving memory
> > on the profiling information kernel BO, since the amount of simultaneous
> > profiled jobs we can write into the queue's ringbuffer might be much
> > smaller than for regular jobs, as the former take more CSF instructions.
> > 
> > This commit is done in preparation for enabling DRM fdinfo support in the
> > Panthor driver, which depends on the numbers calculated herein.
> > 
> > A profile mode mask has been added that will in a future commit allow UM to
> > toggle performance metric sampling behaviour, which is disabled by default
> > to save power. When a ringbuffer CS is constructed, timestamp and cycling
> > sampling instructions are added depending on the enabled flags in the
> > profiling mask.
> > 
> > A helper was provided that calculates the number of instructions for a
> > given set of enablement mask, and these are passed as the number of credits
> > when initialising a DRM scheduler job.
> > 
> > Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> > ---
> >  drivers/gpu/drm/panthor/panthor_device.h |  22 ++
> >  drivers/gpu/drm/panthor/panthor_sched.c  | 327 ++++++++++++++++++++---
> >  2 files changed, 305 insertions(+), 44 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > index e388c0472ba7..a48e30d0af30 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.h
> > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > @@ -66,6 +66,25 @@ struct panthor_irq {
> >  	atomic_t suspended;
> >  };
> >  
> > +/**
> > + * enum panthor_device_profiling_mode - Profiling state
> > + */
> > +enum panthor_device_profiling_flags {
> > +	/** @PANTHOR_DEVICE_PROFILING_DISABLED: Profiling is disabled. */
> > +	PANTHOR_DEVICE_PROFILING_DISABLED = 0,
> > +
> > +	/** @PANTHOR_DEVICE_PROFILING_CYCLES: Sampling job cycles. */
> > +	PANTHOR_DEVICE_PROFILING_CYCLES = BIT(0),
> > +
> > +	/** @PANTHOR_DEVICE_PROFILING_TIMESTAMP: Sampling job timestamp. */
> > +	PANTHOR_DEVICE_PROFILING_TIMESTAMP = BIT(1),
> > +
> > +	/** @PANTHOR_DEVICE_PROFILING_ALL: Sampling everything. */
> > +	PANTHOR_DEVICE_PROFILING_ALL =
> > +	PANTHOR_DEVICE_PROFILING_CYCLES |
> > +	PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> > +};
> > +
> >  /**
> >   * struct panthor_device - Panthor device
> >   */
> > @@ -162,6 +181,9 @@ struct panthor_device {
> >  		 */
> >  		struct page *dummy_latest_flush;
> >  	} pm;
> > +
> > +	/** @profile_mask: User-set profiling flags for job accounting. */
> > +	u32 profile_mask;
> >  };
> >  
> >  /**
> > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > index c426a392b081..b087648bf59a 100644
> > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > @@ -93,6 +93,9 @@
> >  #define MIN_CSGS				3
> >  #define MAX_CSG_PRIO				0xf
> >  
> > +#define NUM_INSTRS_PER_CACHE_LINE		(64 / sizeof(u64))
> > +#define MAX_INSTRS_PER_JOB			32
> > +
> >  struct panthor_group;
> >  
> >  /**
> > @@ -476,6 +479,18 @@ struct panthor_queue {
> >  		 */
> >  		struct list_head in_flight_jobs;
> >  	} fence_ctx;
> > +
> > +	/** @profiling_info: Job profiling data slots and access information. */
> > +	struct {
> > +		/** @slots: Kernel BO holding the slots. */
> > +		struct panthor_kernel_bo *slots;
> > +
> > +		/** @slot_count: Number of jobs ringbuffer can hold at once. */
> > +		u32 slot_count;
> > +
> > +		/** @profiling_seqno: Index of the next available profiling information slot. */
> > +		u32 profiling_seqno;
> 
> Nit: no need to repeat profiling as it's under the profiling_info
> struct. I would simply name that one 'seqno'.
> 
> > +	} profiling_info;
> 
> s/profiling_info/profiling/ ?
> 
> >  };
> >  
> >  /**
> > @@ -661,6 +676,18 @@ struct panthor_group {
> >  	struct list_head wait_node;
> >  };
> >  
> > +struct panthor_job_profiling_data {
> > +	struct {
> > +		u64 before;
> > +		u64 after;
> > +	} cycles;
> > +
> > +	struct {
> > +		u64 before;
> > +		u64 after;
> > +	} time;
> > +};
> > +
> >  /**
> >   * group_queue_work() - Queue a group work
> >   * @group: Group to queue the work for.
> > @@ -774,6 +801,12 @@ struct panthor_job {
> >  
> >  	/** @done_fence: Fence signaled when the job is finished or cancelled. */
> >  	struct dma_fence *done_fence;
> > +
> > +	/** @profile_mask: Current device job profiling enablement bitmask. */
> > +	u32 profile_mask;
> > +
> > +	/** @profile_slot: Job profiling information index in the profiling slots BO. */
> > +	u32 profiling_slot;
> 
> Nit: we tend to group fields together under sub-structs, so I'd say:
> 
> 	struct {
> 		u32 mask; // or flags
> 		u32 slot;
> 	} profiling;
> 
> >  };
> >  
> >  static void
> > @@ -838,6 +871,7 @@ static void group_free_queue(struct panthor_group *group, struct panthor_queue *
> >  
> >  	panthor_kernel_bo_destroy(queue->ringbuf);
> >  	panthor_kernel_bo_destroy(queue->iface.mem);
> > +	panthor_kernel_bo_destroy(queue->profiling_info.slots);
> >  
> >  	/* Release the last_fence we were holding, if any. */
> >  	dma_fence_put(queue->fence_ctx.last_fence);
> > @@ -1982,8 +2016,6 @@ tick_ctx_init(struct panthor_scheduler *sched,
> >  	}
> >  }
> >  
> > -#define NUM_INSTRS_PER_SLOT		16
> > -
> >  static void
> >  group_term_post_processing(struct panthor_group *group)
> >  {
> > @@ -2815,65 +2847,211 @@ static void group_sync_upd_work(struct work_struct *work)
> >  	group_put(group);
> >  }
> >  
> > -static struct dma_fence *
> > -queue_run_job(struct drm_sched_job *sched_job)
> > +struct panthor_job_ringbuf_instrs {
> > +	u64 buffer[MAX_INSTRS_PER_JOB];
> > +	u32 count;
> > +};
> > +
> > +struct panthor_job_instr {
> > +	u32 profile_mask;
> > +	u64 instr;
> > +};
> > +
> > +#define JOB_INSTR(__prof, __instr) \
> > +	{ \
> > +		.profile_mask = __prof, \
> > +		.instr = __instr, \
> > +	}
> > +
> > +static void
> > +copy_instrs_to_ringbuf(struct panthor_queue *queue,
> > +		       struct panthor_job *job,
> > +		       struct panthor_job_ringbuf_instrs *instrs)
> > +{
> > +	ssize_t ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
> > +	u32 start = job->ringbuf.start & (ringbuf_size - 1);
> > +	ssize_t size, written;
> > +
> > +	/*
> > +	 * We need to write a whole slot, including any trailing zeroes
> > +	 * that may come at the end of it. Also, because instrs.buffer had
> > +	 * been zero-initialised, there's no need to pad it with 0's
> > +	 */
> > +	instrs->count = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE);
> > +	size = instrs->count * sizeof(u64);
> > +	written = min(ringbuf_size - start, size);
> > +
> > +	memcpy(queue->ringbuf->kmap + start, instrs->buffer, written);
> > +
> > +	if (written < size)
> > +		memcpy(queue->ringbuf->kmap,
> > +		       &instrs->buffer[(ringbuf_size - start)/sizeof(u64)],
> > +		       size - written);
> > +}
> > +
> > +struct panthor_job_cs_params {
> > +	u32 profile_mask;
> > +	u64 addr_reg; u64 val_reg;
> > +	u64 cycle_reg; u64 time_reg;
> > +	u64 sync_addr; u64 times_addr;
> > +	u64 cs_start; u64 cs_size;
> > +	u32 last_flush; u32 waitall_mask;
> > +};
> > +
> > +static void
> > +get_job_cs_params(struct panthor_job *job, struct panthor_job_cs_params *params)
> >  {
> > -	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> >  	struct panthor_group *group = job->group;
> >  	struct panthor_queue *queue = group->queues[job->queue_idx];
> >  	struct panthor_device *ptdev = group->ptdev;
> >  	struct panthor_scheduler *sched = ptdev->scheduler;
> > -	u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
> > -	u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1);
> > -	u64 addr_reg = ptdev->csif_info.cs_reg_count -
> > -		       ptdev->csif_info.unpreserved_cs_reg_count;
> > -	u64 val_reg = addr_reg + 2;
> > -	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> > -			job->queue_idx * sizeof(struct panthor_syncobj_64b);
> > -	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
> > -	struct dma_fence *done_fence;
> > -	int ret;
> >  
> > -	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
> > +	params->addr_reg = ptdev->csif_info.cs_reg_count -
> > +			   ptdev->csif_info.unpreserved_cs_reg_count;
> > +	params->val_reg = params->addr_reg + 2;
> > +	params->cycle_reg = params->addr_reg;
> > +	params->time_reg = params->val_reg;
> > +
> > +	params->sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> > +			    job->queue_idx * sizeof(struct panthor_syncobj_64b);
> > +	params->times_addr = panthor_kernel_bo_gpuva(queue->profiling_info.slots) +
> > +			     (job->profiling_slot * sizeof(struct panthor_job_profiling_data));
> > +	params->waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
> > +
> > +	params->cs_start = job->call_info.start;
> > +	params->cs_size = job->call_info.size;
> > +	params->last_flush = job->call_info.latest_flush;
> > +
> > +	params->profile_mask = job->profile_mask;
> > +}
> > +
> > +static void
> > +prepare_job_instrs(const struct panthor_job_cs_params *params,
> > +		   struct panthor_job_ringbuf_instrs *instrs)
> > +{
> > +	const struct panthor_job_instr instr_seq[] = {
> >  		/* MOV32 rX+2, cs.latest_flush */
> > -		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > +			  (2ull << 56) | (params->val_reg << 48) | params->last_flush),
> >  
> >  		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
> > -		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > +			  (36ull << 56) | (0ull << 48) | (params->val_reg << 40) | (0 << 16) | 0x233),
> > +
> > +		/* MOV48 rX:rX+1, cycles_offset */
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> > +			  (1ull << 56) | (params->cycle_reg << 48) |
> > +			  (params->times_addr +
> > +			   offsetof(struct panthor_job_profiling_data, cycles.before))),
> > +		/* STORE_STATE cycles */
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> > +			  (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)),
> > +		/* MOV48 rX:rX+1, time_offset */
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> > +			  (1ull << 56) | (params->time_reg << 48) |
> > +			  (params->times_addr +
> > +			   offsetof(struct panthor_job_profiling_data, time.before))),
> > +		/* STORE_STATE timer */
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> > +			  (40ull << 56) | (params->time_reg << 40) | (0ll << 32)),
> >  
> >  		/* MOV48 rX:rX+1, cs.start */
> > -		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
> > -
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > +			  (1ull << 56) | (params->addr_reg << 48) | params->cs_start),
> >  		/* MOV32 rX+2, cs.size */
> > -		(2ull << 56) | (val_reg << 48) | job->call_info.size,
> > -
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > +			  (2ull << 56) | (params->val_reg << 48) | params->cs_size),
> >  		/* WAIT(0) => waits for FLUSH_CACHE2 instruction */
> > -		(3ull << 56) | (1 << 16),
> > -
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (3ull << 56) | (1 << 16)),
> >  		/* CALL rX:rX+1, rX+2 */
> > -		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > +			  (32ull << 56) | (params->addr_reg << 40) | (params->val_reg << 32)),
> > +
> > +		/* MOV48 rX:rX+1, cycles_offset */
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> > +			  (1ull << 56) | (params->cycle_reg << 48) |
> > +			  (params->times_addr +
> > +			   offsetof(struct panthor_job_profiling_data, cycles.after))),
> > +		/* STORE_STATE cycles */
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> > +			  (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)),
> > +
> > +		/* MOV48 rX:rX+1, time_offset */
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> > +			  (1ull << 56) | (params->time_reg << 48) |
> > +			  (params->times_addr +
> > +			   offsetof(struct panthor_job_profiling_data, time.after))),
> > +		/* STORE_STATE timer */
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> > +			  (40ull << 56) | (params->time_reg << 40) | (0ll << 32)),
> >  
> >  		/* MOV48 rX:rX+1, sync_addr */
> > -		(1ull << 56) | (addr_reg << 48) | sync_addr,
> > -
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > +			  (1ull << 56) | (params->addr_reg << 48) | params->sync_addr),
> >  		/* MOV48 rX+2, #1 */
> > -		(1ull << 56) | (val_reg << 48) | 1,
> > -
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > +			  (1ull << 56) | (params->val_reg << 48) | 1),
> >  		/* WAIT(all) */
> > -		(3ull << 56) | (waitall_mask << 16),
> > -
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > +			  (3ull << 56) | (params->waitall_mask << 16)),
> >  		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
> > -		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
> > -
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > +			  (51ull << 56) | (0ull << 48) | (params->addr_reg << 40) |
> > +			  (params->val_reg << 32) | (0 << 16) | 1),
> >  		/* ERROR_BARRIER, so we can recover from faults at job
> >  		 * boundaries.
> >  		 */
> > -		(47ull << 56),
> > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (47ull << 56)),
> > +	};
> > +	u32 pad;
> > +
> > +	/* NEED to be cacheline aligned to please the prefetcher. */
> > +	static_assert(sizeof(instrs->buffer) % 64 == 0,
> > +		      "panthor_job_ringbuf_instrs::buffer is not aligned on a cacheline");
> > +
> > +	/* Make sure we have enough storage to store the whole sequence. */
> > +	static_assert(ALIGN(ARRAY_SIZE(instr_seq), NUM_INSTRS_PER_CACHE_LINE) <=
> > +		      ARRAY_SIZE(instrs->buffer),
> > +		      "instr_seq vs panthor_job_ringbuf_instrs::buffer size mismatch");
> 
> We probably want to catch situations where instrs->buffer has gone
> bigger than needed (say we found a way to drop instructions), so I
> would turn the '<=' condition into an '=='.

I did this but it's triggering the static assert, because the instr_seq array has 19 elements,
which when aligned to NUM_INSTRS_PER_CACHE_LINE (8 I think) renders 24, still less than the
32 slots in instrs->buffer. Having a slightly oversized instrs.buffer array isn't a problem
though because instrs.count is being used when copying them into the ringbuffer, so I think
leaving this assert as an <= should be alright. 

> > +
> > +	for (u32 i = 0; i < ARRAY_SIZE(instr_seq); i++) {
> > +		/* If the profile mask of this instruction is not enabled, skip it. */
> > +		if (instr_seq[i].profile_mask &&
> > +		    !(instr_seq[i].profile_mask & params->profile_mask))
> > +			continue;
> > +
> > +		instrs->buffer[instrs->count++] = instr_seq[i].instr;
> > +	}
> > +
> > +	pad = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE);
> > +	memset(&instrs->buffer[instrs->count], 0,
> > +	       (pad - instrs->count) * sizeof(instrs->buffer[0]));
> > +	instrs->count = pad;
> > +}
> > +
> > +static u32 calc_job_credits(u32 profile_mask)
> > +{
> > +	struct panthor_job_ringbuf_instrs instrs;
> > +	struct panthor_job_cs_params params = {
> > +		.profile_mask = profile_mask,
> >  	};
> >  
> > -	/* Need to be cacheline aligned to please the prefetcher. */
> > -	static_assert(sizeof(call_instrs) % 64 == 0,
> > -		      "call_instrs is not aligned on a cacheline");
> > +	prepare_job_instrs(&params, &instrs);
> > +	return instrs.count;
> > +}
> > +
> > +static struct dma_fence *
> > +queue_run_job(struct drm_sched_job *sched_job)
> > +{
> > +	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> > +	struct panthor_group *group = job->group;
> > +	struct panthor_queue *queue = group->queues[job->queue_idx];
> > +	struct panthor_device *ptdev = group->ptdev;
> > +	struct panthor_scheduler *sched = ptdev->scheduler;
> > +	struct panthor_job_ringbuf_instrs instrs;
> > +	struct panthor_job_cs_params cs_params;
> > +	struct dma_fence *done_fence;
> > +	int ret;
> >  
> >  	/* Stream size is zero, nothing to do except making sure all previously
> >  	 * submitted jobs are done before we signal the
> > @@ -2900,17 +3078,23 @@ queue_run_job(struct drm_sched_job *sched_job)
> >  		       queue->fence_ctx.id,
> >  		       atomic64_inc_return(&queue->fence_ctx.seqno));
> >  
> > -	memcpy(queue->ringbuf->kmap + ringbuf_insert,
> > -	       call_instrs, sizeof(call_instrs));
> > +	job->profiling_slot = queue->profiling_info.profiling_seqno++;
> > +	if (queue->profiling_info.profiling_seqno == queue->profiling_info.slot_count)
> > +		queue->profiling_info.profiling_seqno = 0;
> > +
> > +	job->ringbuf.start = queue->iface.input->insert;
> > +
> > +	get_job_cs_params(job, &cs_params);
> > +	prepare_job_instrs(&cs_params, &instrs);
> > +	copy_instrs_to_ringbuf(queue, job, &instrs);
> > +
> > +	job->ringbuf.end = job->ringbuf.start + (instrs.count * sizeof(u64));
> >  
> >  	panthor_job_get(&job->base);
> >  	spin_lock(&queue->fence_ctx.lock);
> >  	list_add_tail(&job->node, &queue->fence_ctx.in_flight_jobs);
> >  	spin_unlock(&queue->fence_ctx.lock);
> >  
> > -	job->ringbuf.start = queue->iface.input->insert;
> > -	job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
> > -
> >  	/* Make sure the ring buffer is updated before the INSERT
> >  	 * register.
> >  	 */
> > @@ -3003,6 +3187,24 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
> >  	.free_job = queue_free_job,
> >  };
> >  
> > +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
> > +				       u32 cs_ringbuf_size)
> > +{
> > +	u32 min_profiled_job_instrs = U32_MAX;
> > +	u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
> > +
> > +	for (u32 i = 0; i < last_flag; i++) {
> > +		if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
> > +			min_profiled_job_instrs =
> > +				min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
> > +	}
> 
> Okay, I think this loop deserves an explanation. The goal is to
> calculate the minimal size of a profile job so we can deduce the
> maximum number of profiling slots that will be used simultaneously. We
> ignore PANTHOR_DEVICE_PROFILING_DISABLED, because those jobs won't use
> a profiling slot in the first place.
> 
> > +
> > +	drm_WARN_ON(&ptdev->base,
> > +		    !IS_ALIGNED(min_profiled_job_instrs, NUM_INSTRS_PER_CACHE_LINE));
> 
> We can probably drop this WARN_ON(), it's supposed to be checked in
> calc_job_credits().
> 
> > +
> > +	return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
> > +}
> > +
> >  static struct panthor_queue *
> >  group_create_queue(struct panthor_group *group,
> >  		   const struct drm_panthor_queue_create *args)
> > @@ -3056,9 +3258,38 @@ group_create_queue(struct panthor_group *group,
> >  		goto err_free_queue;
> >  	}
> >  
> > +	queue->profiling_info.slot_count =
> > +		calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
> > +
> > +	queue->profiling_info.slots =
> > +		panthor_kernel_bo_create(group->ptdev, group->vm,
> > +					 queue->profiling_info.slot_count *
> > +					 sizeof(struct panthor_job_profiling_data),
> > +					 DRM_PANTHOR_BO_NO_MMAP,
> > +					 DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> > +					 DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> > +					 PANTHOR_VM_KERNEL_AUTO_VA);
> > +
> > +	if (IS_ERR(queue->profiling_info.slots)) {
> > +		ret = PTR_ERR(queue->profiling_info.slots);
> > +		goto err_free_queue;
> > +	}
> > +
> > +	ret = panthor_kernel_bo_vmap(queue->profiling_info.slots);
> > +	if (ret)
> > +		goto err_free_queue;
> > +
> > +	memset(queue->profiling_info.slots->kmap, 0,
> > +	       queue->profiling_info.slot_count * sizeof(struct panthor_job_profiling_data));
> 
> I don't think we need to memset() the profiling buffer.
> 
> > +
> > +	/*
> > +	 * Credit limit argument tells us the total number of instructions
> > +	 * across all CS slots in the ringbuffer, with some jobs requiring
> > +	 * twice as many as others, depending on their profiling status.
> > +	 */
> >  	ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
> >  			     group->ptdev->scheduler->wq, 1,
> > -			     args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
> > +			     args->ringbuf_size / sizeof(u64),
> >  			     0, msecs_to_jiffies(JOB_TIMEOUT_MS),
> >  			     group->ptdev->reset.wq,
> >  			     NULL, "panthor-queue", group->ptdev->base.dev);
> > @@ -3354,6 +3585,7 @@ panthor_job_create(struct panthor_file *pfile,
> >  {
> >  	struct panthor_group_pool *gpool = pfile->groups;
> >  	struct panthor_job *job;
> > +	u32 credits;
> >  	int ret;
> >  
> >  	if (qsubmit->pad)
> > @@ -3407,9 +3639,16 @@ panthor_job_create(struct panthor_file *pfile,
> >  		}
> >  	}
> >  
> > +	job->profile_mask = pfile->ptdev->profile_mask;
> > +	credits = calc_job_credits(job->profile_mask);
> > +	if (credits == 0) {
> > +		ret = -EINVAL;
> > +		goto err_put_job;
> > +	}
> > +
> >  	ret = drm_sched_job_init(&job->base,
> >  				 &job->group->queues[job->queue_idx]->entity,
> > -				 1, job->group);
> > +				 credits, job->group);
> >  	if (ret)
> >  		goto err_put_job;
> >  
> 
> Just add a bunch of minor comments (mostly cosmetic changes), but the
> implementation looks good to me.
> 
> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com>

Adrian Larumbe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-09-12 15:03     ` Adrián Larumbe
@ 2024-09-12 15:10       ` Boris Brezillon
  0 siblings, 0 replies; 15+ messages in thread
From: Boris Brezillon @ 2024-09-12 15:10 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, Sumit Semwal,
	Christian König, kernel, dri-devel, linux-kernel,
	linux-media, linaro-mm-sig

On Thu, 12 Sep 2024 16:03:37 +0100
Adrián Larumbe <adrian.larumbe@collabora.com> wrote:

> Hi Boris, thanks for the remarks,
> 
> On 04.09.2024 09:49, Boris Brezillon wrote:
> > On Tue,  3 Sep 2024 21:25:35 +0100
> > Adrián Larumbe <adrian.larumbe@collabora.com> wrote:
> >   
> > > Enable calculations of job submission times in clock cycles and wall
> > > time. This is done by expanding the boilerplate command stream when running
> > > a job to include instructions that compute said times right before an after
> > > a user CS.
> > > 
> > > A separate kernel BO is created per queue to store those values. Jobs can
> > > access their sampled data through a slots buffer-specific index different
> > > from that of the queue's ringbuffer. The reason for this is saving memory
> > > on the profiling information kernel BO, since the amount of simultaneous
> > > profiled jobs we can write into the queue's ringbuffer might be much
> > > smaller than for regular jobs, as the former take more CSF instructions.
> > > 
> > > This commit is done in preparation for enabling DRM fdinfo support in the
> > > Panthor driver, which depends on the numbers calculated herein.
> > > 
> > > A profile mode mask has been added that will in a future commit allow UM to
> > > toggle performance metric sampling behaviour, which is disabled by default
> > > to save power. When a ringbuffer CS is constructed, timestamp and cycling
> > > sampling instructions are added depending on the enabled flags in the
> > > profiling mask.
> > > 
> > > A helper was provided that calculates the number of instructions for a
> > > given set of enablement mask, and these are passed as the number of credits
> > > when initialising a DRM scheduler job.
> > > 
> > > Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> > > ---
> > >  drivers/gpu/drm/panthor/panthor_device.h |  22 ++
> > >  drivers/gpu/drm/panthor/panthor_sched.c  | 327 ++++++++++++++++++++---
> > >  2 files changed, 305 insertions(+), 44 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > > index e388c0472ba7..a48e30d0af30 100644
> > > --- a/drivers/gpu/drm/panthor/panthor_device.h
> > > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > > @@ -66,6 +66,25 @@ struct panthor_irq {
> > >  	atomic_t suspended;
> > >  };
> > >  
> > > +/**
> > > + * enum panthor_device_profiling_mode - Profiling state
> > > + */
> > > +enum panthor_device_profiling_flags {
> > > +	/** @PANTHOR_DEVICE_PROFILING_DISABLED: Profiling is disabled. */
> > > +	PANTHOR_DEVICE_PROFILING_DISABLED = 0,
> > > +
> > > +	/** @PANTHOR_DEVICE_PROFILING_CYCLES: Sampling job cycles. */
> > > +	PANTHOR_DEVICE_PROFILING_CYCLES = BIT(0),
> > > +
> > > +	/** @PANTHOR_DEVICE_PROFILING_TIMESTAMP: Sampling job timestamp. */
> > > +	PANTHOR_DEVICE_PROFILING_TIMESTAMP = BIT(1),
> > > +
> > > +	/** @PANTHOR_DEVICE_PROFILING_ALL: Sampling everything. */
> > > +	PANTHOR_DEVICE_PROFILING_ALL =
> > > +	PANTHOR_DEVICE_PROFILING_CYCLES |
> > > +	PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> > > +};
> > > +
> > >  /**
> > >   * struct panthor_device - Panthor device
> > >   */
> > > @@ -162,6 +181,9 @@ struct panthor_device {
> > >  		 */
> > >  		struct page *dummy_latest_flush;
> > >  	} pm;
> > > +
> > > +	/** @profile_mask: User-set profiling flags for job accounting. */
> > > +	u32 profile_mask;
> > >  };
> > >  
> > >  /**
> > > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > > index c426a392b081..b087648bf59a 100644
> > > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > > @@ -93,6 +93,9 @@
> > >  #define MIN_CSGS				3
> > >  #define MAX_CSG_PRIO				0xf
> > >  
> > > +#define NUM_INSTRS_PER_CACHE_LINE		(64 / sizeof(u64))
> > > +#define MAX_INSTRS_PER_JOB			32
> > > +
> > >  struct panthor_group;
> > >  
> > >  /**
> > > @@ -476,6 +479,18 @@ struct panthor_queue {
> > >  		 */
> > >  		struct list_head in_flight_jobs;
> > >  	} fence_ctx;
> > > +
> > > +	/** @profiling_info: Job profiling data slots and access information. */
> > > +	struct {
> > > +		/** @slots: Kernel BO holding the slots. */
> > > +		struct panthor_kernel_bo *slots;
> > > +
> > > +		/** @slot_count: Number of jobs ringbuffer can hold at once. */
> > > +		u32 slot_count;
> > > +
> > > +		/** @profiling_seqno: Index of the next available profiling information slot. */
> > > +		u32 profiling_seqno;  
> > 
> > Nit: no need to repeat profiling as it's under the profiling_info
> > struct. I would simply name that one 'seqno'.
> >   
> > > +	} profiling_info;  
> > 
> > s/profiling_info/profiling/ ?
> >   
> > >  };
> > >  
> > >  /**
> > > @@ -661,6 +676,18 @@ struct panthor_group {
> > >  	struct list_head wait_node;
> > >  };
> > >  
> > > +struct panthor_job_profiling_data {
> > > +	struct {
> > > +		u64 before;
> > > +		u64 after;
> > > +	} cycles;
> > > +
> > > +	struct {
> > > +		u64 before;
> > > +		u64 after;
> > > +	} time;
> > > +};
> > > +
> > >  /**
> > >   * group_queue_work() - Queue a group work
> > >   * @group: Group to queue the work for.
> > > @@ -774,6 +801,12 @@ struct panthor_job {
> > >  
> > >  	/** @done_fence: Fence signaled when the job is finished or cancelled. */
> > >  	struct dma_fence *done_fence;
> > > +
> > > +	/** @profile_mask: Current device job profiling enablement bitmask. */
> > > +	u32 profile_mask;
> > > +
> > > +	/** @profile_slot: Job profiling information index in the profiling slots BO. */
> > > +	u32 profiling_slot;  
> > 
> > Nit: we tend to group fields together under sub-structs, so I'd say:
> > 
> > 	struct {
> > 		u32 mask; // or flags
> > 		u32 slot;
> > 	} profiling;
> >   
> > >  };
> > >  
> > >  static void
> > > @@ -838,6 +871,7 @@ static void group_free_queue(struct panthor_group *group, struct panthor_queue *
> > >  
> > >  	panthor_kernel_bo_destroy(queue->ringbuf);
> > >  	panthor_kernel_bo_destroy(queue->iface.mem);
> > > +	panthor_kernel_bo_destroy(queue->profiling_info.slots);
> > >  
> > >  	/* Release the last_fence we were holding, if any. */
> > >  	dma_fence_put(queue->fence_ctx.last_fence);
> > > @@ -1982,8 +2016,6 @@ tick_ctx_init(struct panthor_scheduler *sched,
> > >  	}
> > >  }
> > >  
> > > -#define NUM_INSTRS_PER_SLOT		16
> > > -
> > >  static void
> > >  group_term_post_processing(struct panthor_group *group)
> > >  {
> > > @@ -2815,65 +2847,211 @@ static void group_sync_upd_work(struct work_struct *work)
> > >  	group_put(group);
> > >  }
> > >  
> > > -static struct dma_fence *
> > > -queue_run_job(struct drm_sched_job *sched_job)
> > > +struct panthor_job_ringbuf_instrs {
> > > +	u64 buffer[MAX_INSTRS_PER_JOB];
> > > +	u32 count;
> > > +};
> > > +
> > > +struct panthor_job_instr {
> > > +	u32 profile_mask;
> > > +	u64 instr;
> > > +};
> > > +
> > > +#define JOB_INSTR(__prof, __instr) \
> > > +	{ \
> > > +		.profile_mask = __prof, \
> > > +		.instr = __instr, \
> > > +	}
> > > +
> > > +static void
> > > +copy_instrs_to_ringbuf(struct panthor_queue *queue,
> > > +		       struct panthor_job *job,
> > > +		       struct panthor_job_ringbuf_instrs *instrs)
> > > +{
> > > +	ssize_t ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
> > > +	u32 start = job->ringbuf.start & (ringbuf_size - 1);
> > > +	ssize_t size, written;
> > > +
> > > +	/*
> > > +	 * We need to write a whole slot, including any trailing zeroes
> > > +	 * that may come at the end of it. Also, because instrs.buffer had
> > > +	 * been zero-initialised, there's no need to pad it with 0's
> > > +	 */
> > > +	instrs->count = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE);
> > > +	size = instrs->count * sizeof(u64);
> > > +	written = min(ringbuf_size - start, size);
> > > +
> > > +	memcpy(queue->ringbuf->kmap + start, instrs->buffer, written);
> > > +
> > > +	if (written < size)
> > > +		memcpy(queue->ringbuf->kmap,
> > > +		       &instrs->buffer[(ringbuf_size - start)/sizeof(u64)],
> > > +		       size - written);
> > > +}
> > > +
> > > +struct panthor_job_cs_params {
> > > +	u32 profile_mask;
> > > +	u64 addr_reg; u64 val_reg;
> > > +	u64 cycle_reg; u64 time_reg;
> > > +	u64 sync_addr; u64 times_addr;
> > > +	u64 cs_start; u64 cs_size;
> > > +	u32 last_flush; u32 waitall_mask;
> > > +};
> > > +
> > > +static void
> > > +get_job_cs_params(struct panthor_job *job, struct panthor_job_cs_params *params)
> > >  {
> > > -	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> > >  	struct panthor_group *group = job->group;
> > >  	struct panthor_queue *queue = group->queues[job->queue_idx];
> > >  	struct panthor_device *ptdev = group->ptdev;
> > >  	struct panthor_scheduler *sched = ptdev->scheduler;
> > > -	u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
> > > -	u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1);
> > > -	u64 addr_reg = ptdev->csif_info.cs_reg_count -
> > > -		       ptdev->csif_info.unpreserved_cs_reg_count;
> > > -	u64 val_reg = addr_reg + 2;
> > > -	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> > > -			job->queue_idx * sizeof(struct panthor_syncobj_64b);
> > > -	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
> > > -	struct dma_fence *done_fence;
> > > -	int ret;
> > >  
> > > -	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
> > > +	params->addr_reg = ptdev->csif_info.cs_reg_count -
> > > +			   ptdev->csif_info.unpreserved_cs_reg_count;
> > > +	params->val_reg = params->addr_reg + 2;
> > > +	params->cycle_reg = params->addr_reg;
> > > +	params->time_reg = params->val_reg;
> > > +
> > > +	params->sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> > > +			    job->queue_idx * sizeof(struct panthor_syncobj_64b);
> > > +	params->times_addr = panthor_kernel_bo_gpuva(queue->profiling_info.slots) +
> > > +			     (job->profiling_slot * sizeof(struct panthor_job_profiling_data));
> > > +	params->waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
> > > +
> > > +	params->cs_start = job->call_info.start;
> > > +	params->cs_size = job->call_info.size;
> > > +	params->last_flush = job->call_info.latest_flush;
> > > +
> > > +	params->profile_mask = job->profile_mask;
> > > +}
> > > +
> > > +static void
> > > +prepare_job_instrs(const struct panthor_job_cs_params *params,
> > > +		   struct panthor_job_ringbuf_instrs *instrs)
> > > +{
> > > +	const struct panthor_job_instr instr_seq[] = {
> > >  		/* MOV32 rX+2, cs.latest_flush */
> > > -		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > > +			  (2ull << 56) | (params->val_reg << 48) | params->last_flush),
> > >  
> > >  		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
> > > -		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > > +			  (36ull << 56) | (0ull << 48) | (params->val_reg << 40) | (0 << 16) | 0x233),
> > > +
> > > +		/* MOV48 rX:rX+1, cycles_offset */
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> > > +			  (1ull << 56) | (params->cycle_reg << 48) |
> > > +			  (params->times_addr +
> > > +			   offsetof(struct panthor_job_profiling_data, cycles.before))),
> > > +		/* STORE_STATE cycles */
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> > > +			  (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)),
> > > +		/* MOV48 rX:rX+1, time_offset */
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> > > +			  (1ull << 56) | (params->time_reg << 48) |
> > > +			  (params->times_addr +
> > > +			   offsetof(struct panthor_job_profiling_data, time.before))),
> > > +		/* STORE_STATE timer */
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> > > +			  (40ull << 56) | (params->time_reg << 40) | (0ll << 32)),
> > >  
> > >  		/* MOV48 rX:rX+1, cs.start */
> > > -		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
> > > -
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > > +			  (1ull << 56) | (params->addr_reg << 48) | params->cs_start),
> > >  		/* MOV32 rX+2, cs.size */
> > > -		(2ull << 56) | (val_reg << 48) | job->call_info.size,
> > > -
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > > +			  (2ull << 56) | (params->val_reg << 48) | params->cs_size),
> > >  		/* WAIT(0) => waits for FLUSH_CACHE2 instruction */
> > > -		(3ull << 56) | (1 << 16),
> > > -
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (3ull << 56) | (1 << 16)),
> > >  		/* CALL rX:rX+1, rX+2 */
> > > -		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > > +			  (32ull << 56) | (params->addr_reg << 40) | (params->val_reg << 32)),
> > > +
> > > +		/* MOV48 rX:rX+1, cycles_offset */
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> > > +			  (1ull << 56) | (params->cycle_reg << 48) |
> > > +			  (params->times_addr +
> > > +			   offsetof(struct panthor_job_profiling_data, cycles.after))),
> > > +		/* STORE_STATE cycles */
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES,
> > > +			  (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)),
> > > +
> > > +		/* MOV48 rX:rX+1, time_offset */
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> > > +			  (1ull << 56) | (params->time_reg << 48) |
> > > +			  (params->times_addr +
> > > +			   offsetof(struct panthor_job_profiling_data, time.after))),
> > > +		/* STORE_STATE timer */
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP,
> > > +			  (40ull << 56) | (params->time_reg << 40) | (0ll << 32)),
> > >  
> > >  		/* MOV48 rX:rX+1, sync_addr */
> > > -		(1ull << 56) | (addr_reg << 48) | sync_addr,
> > > -
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > > +			  (1ull << 56) | (params->addr_reg << 48) | params->sync_addr),
> > >  		/* MOV48 rX+2, #1 */
> > > -		(1ull << 56) | (val_reg << 48) | 1,
> > > -
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > > +			  (1ull << 56) | (params->val_reg << 48) | 1),
> > >  		/* WAIT(all) */
> > > -		(3ull << 56) | (waitall_mask << 16),
> > > -
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > > +			  (3ull << 56) | (params->waitall_mask << 16)),
> > >  		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
> > > -		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
> > > -
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED,
> > > +			  (51ull << 56) | (0ull << 48) | (params->addr_reg << 40) |
> > > +			  (params->val_reg << 32) | (0 << 16) | 1),
> > >  		/* ERROR_BARRIER, so we can recover from faults at job
> > >  		 * boundaries.
> > >  		 */
> > > -		(47ull << 56),
> > > +		JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (47ull << 56)),
> > > +	};
> > > +	u32 pad;
> > > +
> > > +	/* NEED to be cacheline aligned to please the prefetcher. */
> > > +	static_assert(sizeof(instrs->buffer) % 64 == 0,
> > > +		      "panthor_job_ringbuf_instrs::buffer is not aligned on a cacheline");
> > > +
> > > +	/* Make sure we have enough storage to store the whole sequence. */
> > > +	static_assert(ALIGN(ARRAY_SIZE(instr_seq), NUM_INSTRS_PER_CACHE_LINE) <=
> > > +		      ARRAY_SIZE(instrs->buffer),
> > > +		      "instr_seq vs panthor_job_ringbuf_instrs::buffer size mismatch");  
> > 
> > We probably want to catch situations where instrs->buffer has gone
> > bigger than needed (say we found a way to drop instructions), so I
> > would turn the '<=' condition into an '=='.  
> 
> I did this but it's triggering the static assert, because the instr_seq array has 19 elements,
> which when aligned to NUM_INSTRS_PER_CACHE_LINE (8 I think) renders 24, still less than the
> 32 slots in instrs->buffer. Having a slightly oversized instrs.buffer array isn't a problem
> though because instrs.count is being used when copying them into the ringbuffer, so I think
> leaving this assert as an <= should be alright. 

The whole point is to have a buffer that's the right size, rather than
bigger than needed. So I'd suggest changing the MAX_INSTRS definition
to make it 24 instead.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-09-12 15:10 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-03 20:25 [PATCH v5 0/4] Support fdinfo runtime and memory stats on Panthor Adrián Larumbe
2024-09-03 20:25 ` [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
2024-09-04  7:49   ` Boris Brezillon
2024-09-12 15:03     ` Adrián Larumbe
2024-09-12 15:10       ` Boris Brezillon
2024-09-04 16:10   ` kernel test robot
2024-09-04 17:02   ` kernel test robot
2024-09-04 18:55   ` Liviu Dudau
2024-09-03 20:25 ` [PATCH v5 2/4] drm/panthor: add DRM fdinfo support Adrián Larumbe
2024-09-04  8:01   ` Boris Brezillon
2024-09-04 17:45   ` kernel test robot
2024-09-03 20:25 ` [PATCH v5 3/4] drm/panthor: enable fdinfo for memory stats Adrián Larumbe
2024-09-04  8:03   ` Boris Brezillon
2024-09-03 20:25 ` [PATCH v5 4/4] drm/panthor: add sysfs knob for enabling job profiling Adrián Larumbe
2024-09-04  8:07   ` Boris Brezillon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox