[PATCH v4 0/4] Support fdinfo runtime and memory stats on Panthor

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/4] Support fdinfo runtime and memory stats on Panthor
@ 2024-07-16 20:11 Adrián Larumbe
  2024-07-16 20:11 ` [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Adrián Larumbe @ 2024-07-16 20:11 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter
  Cc: kernel, Adrián Larumbe, dri-devel, linux-kernel

This patch series enables userspace utilities like gputop and nvtop to
query a render context's fdinfo file and figure out rates of engine
and memory utilisation.

Previous discussion can be found at
https://lore.kernel.org/dri-devel/dqhnxhgho6spfh7xhw6yvs2iiqeqzeg63e6jqqpw2g7gkrfphn@dojsixyl4esv/

Changelog:
v4:
 - Fixed wrong assignment location for frequency values in Panthor's devfreq
 - Removed the last two commits about registering size of internal BO's
 - Rearranged patch series so that sysfs knob is done last and all the previous
 time sampling and fdinfo show dependencies are already in place
v3:
 - Fixed some nits and removed useless bounds check in panthor_sched.c
 - Added support for sysfs profiling knob and optional job accounting
 - Added new patches for calculating size of internal BO's
v2:
 - Split original first patch in two, one for FW CS cycle and timestamp
 calculations and job accounting memory management, and a second one
 that enables fdinfo.
 - Moved NUM_INSTRS_PER_SLOT to the file prelude
 - Removed nelem variable from the group's struct definition.
 - Precompute size of group's syncobj BO to avoid code duplication.
 - Some minor nits.

Adrián Larumbe (4):
  drm/panthor: introduce job cycle and timestamp accounting
  drm/panthor: add DRM fdinfo support
  drm/panthor: enable fdinfo for memory stats
  drm/panthor: add sysfs knob for enabling job profiling

 drivers/gpu/drm/panthor/panthor_devfreq.c |  18 +-
 drivers/gpu/drm/panthor/panthor_device.h  |  12 +
 drivers/gpu/drm/panthor/panthor_drv.c     |  69 +++++
 drivers/gpu/drm/panthor/panthor_gem.c     |  12 +
 drivers/gpu/drm/panthor/panthor_sched.c   | 291 +++++++++++++++++++---
 5 files changed, 371 insertions(+), 31 deletions(-)

-- 
2.45.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-07-16 20:11 [PATCH v4 0/4] Support fdinfo runtime and memory stats on Panthor Adrián Larumbe
@ 2024-07-16 20:11 ` Adrián Larumbe
  2024-07-19 14:14   ` Steven Price
  2024-08-19 12:22   ` Boris Brezillon
  2024-07-16 20:11 ` [PATCH v4 2/4] drm/panthor: add DRM fdinfo support Adrián Larumbe
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 13+ messages in thread
From: Adrián Larumbe @ 2024-07-16 20:11 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter
  Cc: kernel, Adrián Larumbe, dri-devel, linux-kernel

Enable calculations of job submission times in clock cycles and wall
time. This is done by expanding the boilerplate command stream when running
a job to include instructions that compute said times right before an after
a user CS.

Those numbers are stored in the queue's group's sync objects BO, right
after them. Because the queues in a group might have a different number of
slots, one must keep track of the overall slot tally when reckoning the
offset of a queue's time sample structs, one for each slot.

This commit is done in preparation for enabling DRM fdinfo support in the
Panthor driver, which depends on the numbers calculated herein.

A profile mode device flag has been added that will in a future commit
allow UM to toggle time sampling behaviour, which is disabled by default to
save power. It also enables marking jobs as being profiled and picks one of
two call instruction arrays to insert into the ring buffer. One of them
includes FW logic to sample the timestamp and cycle counter registers and
write them into the job's syncobj, and the other does not.

A profiled job's call sequence takes up two ring buffer slots, and this is
reflected when initialising the DRM scheduler for each queue, with a
profiled job contributing twice as many credits.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_device.h |   2 +
 drivers/gpu/drm/panthor/panthor_sched.c  | 244 ++++++++++++++++++++---
 2 files changed, 216 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index e388c0472ba7..3ede2f80df73 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -162,6 +162,8 @@ struct panthor_device {
 		 */
 		struct page *dummy_latest_flush;
 	} pm;
+
+	bool profile_mode;
 };
 
 /**
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 79ffcbc41d78..6438e5ea1f2b 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -93,6 +93,9 @@
 #define MIN_CSGS				3
 #define MAX_CSG_PRIO				0xf
 
+#define NUM_INSTRS_PER_SLOT			16
+#define SLOTSIZE				(NUM_INSTRS_PER_SLOT * sizeof(u64))
+
 struct panthor_group;
 
 /**
@@ -466,6 +469,9 @@ struct panthor_queue {
 		 */
 		struct list_head in_flight_jobs;
 	} fence_ctx;
+
+	/** @time_offset: Offset of panthor_job_times structs in group's syncobj bo. */
+	unsigned long time_offset;
 };
 
 /**
@@ -592,7 +598,17 @@ struct panthor_group {
 	 * One sync object per queue. The position of the sync object is
 	 * determined by the queue index.
 	 */
-	struct panthor_kernel_bo *syncobjs;
+
+	struct {
+		/** @bo: Kernel BO holding the sync objects. */
+		struct panthor_kernel_bo *bo;
+
+		/**
+		 * @job_times_offset: Beginning of panthor_job_times struct samples after
+		 * the group's array of sync objects.
+		 */
+		size_t job_times_offset;
+	} syncobjs;
 
 	/** @state: Group state. */
 	enum panthor_group_state state;
@@ -651,6 +667,18 @@ struct panthor_group {
 	struct list_head wait_node;
 };
 
+struct panthor_job_times {
+	struct {
+		u64 before;
+		u64 after;
+	} cycles;
+
+	struct {
+		u64 before;
+		u64 after;
+	} time;
+};
+
 /**
  * group_queue_work() - Queue a group work
  * @group: Group to queue the work for.
@@ -730,6 +758,9 @@ struct panthor_job {
 	/** @queue_idx: Index of the queue inside @group. */
 	u32 queue_idx;
 
+	/** @ringbuf_idx: Index of the ringbuffer inside @queue. */
+	u32 ringbuf_idx;
+
 	/** @call_info: Information about the userspace command stream call. */
 	struct {
 		/** @start: GPU address of the userspace command stream. */
@@ -764,6 +795,9 @@ struct panthor_job {
 
 	/** @done_fence: Fence signaled when the job is finished or cancelled. */
 	struct dma_fence *done_fence;
+
+	/** @is_profiled: Whether timestamp and cycle numbers were gathered for this job */
+	bool is_profiled;
 };
 
 static void
@@ -844,7 +878,7 @@ static void group_release_work(struct work_struct *work)
 
 	panthor_kernel_bo_destroy(group->suspend_buf);
 	panthor_kernel_bo_destroy(group->protm_suspend_buf);
-	panthor_kernel_bo_destroy(group->syncobjs);
+	panthor_kernel_bo_destroy(group->syncobjs.bo);
 
 	panthor_vm_put(group->vm);
 	kfree(group);
@@ -1969,8 +2003,6 @@ tick_ctx_init(struct panthor_scheduler *sched,
 	}
 }
 
-#define NUM_INSTRS_PER_SLOT		16
-
 static void
 group_term_post_processing(struct panthor_group *group)
 {
@@ -2007,7 +2039,7 @@ group_term_post_processing(struct panthor_group *group)
 		spin_unlock(&queue->fence_ctx.lock);
 
 		/* Manually update the syncobj seqno to unblock waiters. */
-		syncobj = group->syncobjs->kmap + (i * sizeof(*syncobj));
+		syncobj = group->syncobjs.bo->kmap + (i * sizeof(*syncobj));
 		syncobj->status = ~0;
 		syncobj->seqno = atomic64_read(&queue->fence_ctx.seqno);
 		sched_queue_work(group->ptdev->scheduler, sync_upd);
@@ -2780,7 +2812,7 @@ static void group_sync_upd_work(struct work_struct *work)
 		if (!queue)
 			continue;
 
-		syncobj = group->syncobjs->kmap + (queue_idx * sizeof(*syncobj));
+		syncobj = group->syncobjs.bo->kmap + (queue_idx * sizeof(*syncobj));
 
 		spin_lock(&queue->fence_ctx.lock);
 		list_for_each_entry_safe(job, job_tmp, &queue->fence_ctx.in_flight_jobs, node) {
@@ -2815,22 +2847,81 @@ queue_run_job(struct drm_sched_job *sched_job)
 	struct panthor_scheduler *sched = ptdev->scheduler;
 	u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
 	u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1);
+	u32 ringbuf_index = ringbuf_insert / (SLOTSIZE);
+	bool ringbuf_wraparound =
+		job->is_profiled && ((ringbuf_size/SLOTSIZE) == ringbuf_index + 1);
 	u64 addr_reg = ptdev->csif_info.cs_reg_count -
 		       ptdev->csif_info.unpreserved_cs_reg_count;
 	u64 val_reg = addr_reg + 2;
-	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
-			job->queue_idx * sizeof(struct panthor_syncobj_64b);
+	u64 cycle_reg = addr_reg;
+	u64 time_reg = val_reg;
+	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) +
+		job->queue_idx * sizeof(struct panthor_syncobj_64b);
+	u64 times_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) + queue->time_offset +
+		(ringbuf_index * sizeof(struct panthor_job_times));
+	size_t call_insrt_size;
+	u64 *call_instrs;
+
 	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
 	struct dma_fence *done_fence;
 	int ret;
 
-	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
+	u64 call_instrs_simple[NUM_INSTRS_PER_SLOT] = {
+		/* MOV32 rX+2, cs.latest_flush */
+		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
+
+		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
+		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
+
+		/* MOV48 rX:rX+1, cs.start */
+		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
+
+		/* MOV32 rX+2, cs.size */
+		(2ull << 56) | (val_reg << 48) | job->call_info.size,
+
+		/* WAIT(0) => waits for FLUSH_CACHE2 instruction */
+		(3ull << 56) | (1 << 16),
+
+		/* CALL rX:rX+1, rX+2 */
+		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
+
+		/* MOV48 rX:rX+1, sync_addr */
+		(1ull << 56) | (addr_reg << 48) | sync_addr,
+
+		/* MOV48 rX+2, #1 */
+		(1ull << 56) | (val_reg << 48) | 1,
+
+		/* WAIT(all) */
+		(3ull << 56) | (waitall_mask << 16),
+
+		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
+		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
+
+		/* ERROR_BARRIER, so we can recover from faults at job
+		 * boundaries.
+		 */
+		(47ull << 56),
+	};
+
+	u64 call_instrs_profile[NUM_INSTRS_PER_SLOT*2] = {
 		/* MOV32 rX+2, cs.latest_flush */
 		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
 
 		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
 		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
 
+		/* MOV48 rX:rX+1, cycles_offset */
+		(1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.before)),
+
+		/* MOV48 rX:rX+1, time_offset */
+		(1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.before)),
+
+		/* STORE_STATE cycles */
+		(40ull << 56) |  (cycle_reg << 40) | (1ll << 32),
+
+		/* STORE_STATE timer */
+		(40ull << 56) |  (time_reg << 40) | (0ll << 32),
+
 		/* MOV48 rX:rX+1, cs.start */
 		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
 
@@ -2843,6 +2934,18 @@ queue_run_job(struct drm_sched_job *sched_job)
 		/* CALL rX:rX+1, rX+2 */
 		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
 
+		/* MOV48 rX:rX+1, cycles_offset */
+		(1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.after)),
+
+		/* MOV48 rX:rX+1, time_offset */
+		(1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.after)),
+
+		/* STORE_STATE cycles */
+		(40ull << 56) |  (cycle_reg << 40) | (1ll << 32),
+
+		/* STORE_STATE timer */
+		(40ull << 56) |  (time_reg << 40) | (0ll << 32),
+
 		/* MOV48 rX:rX+1, sync_addr */
 		(1ull << 56) | (addr_reg << 48) | sync_addr,
 
@@ -2862,9 +2965,18 @@ queue_run_job(struct drm_sched_job *sched_job)
 	};
 
 	/* Need to be cacheline aligned to please the prefetcher. */
-	static_assert(sizeof(call_instrs) % 64 == 0,
+	static_assert(sizeof(call_instrs_simple) % 64 == 0 && sizeof(call_instrs_profile) % 64 == 0,
 		      "call_instrs is not aligned on a cacheline");
 
+	if (job->is_profiled) {
+		call_instrs = call_instrs_profile;
+		call_insrt_size = sizeof(call_instrs_profile);
+
+	} else {
+		call_instrs = call_instrs_simple;
+		call_insrt_size = sizeof(call_instrs_simple);
+	}
+
 	/* Stream size is zero, nothing to do => return a NULL fence and let
 	 * drm_sched signal the parent.
 	 */
@@ -2887,8 +2999,23 @@ queue_run_job(struct drm_sched_job *sched_job)
 		       queue->fence_ctx.id,
 		       atomic64_inc_return(&queue->fence_ctx.seqno));
 
-	memcpy(queue->ringbuf->kmap + ringbuf_insert,
-	       call_instrs, sizeof(call_instrs));
+	/*
+	 * Need to handle the wrap-around case when copying profiled instructions
+	 * from an odd-indexed slot. The reason this can happen is user space is
+	 * able to control the profiling status of the driver through a sysfs
+	 * knob, so this might lead to a timestamp and cycles-profiling call
+	 * instruction stream beginning at an odd-number slot. The GPU should
+	 * be able to gracefully handle this.
+	 */
+	if (!ringbuf_wraparound) {
+		memcpy(queue->ringbuf->kmap + ringbuf_insert,
+		       call_instrs, call_insrt_size);
+	} else {
+		memcpy(queue->ringbuf->kmap + ringbuf_insert,
+		       call_instrs, call_insrt_size/2);
+		memcpy(queue->ringbuf->kmap, call_instrs +
+		       NUM_INSTRS_PER_SLOT, call_insrt_size/2);
+	}
 
 	panthor_job_get(&job->base);
 	spin_lock(&queue->fence_ctx.lock);
@@ -2896,7 +3023,8 @@ queue_run_job(struct drm_sched_job *sched_job)
 	spin_unlock(&queue->fence_ctx.lock);
 
 	job->ringbuf.start = queue->iface.input->insert;
-	job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
+	job->ringbuf.end = job->ringbuf.start + call_insrt_size;
+	job->ringbuf_idx = ringbuf_index;
 
 	/* Make sure the ring buffer is updated before the INSERT
 	 * register.
@@ -2987,7 +3115,8 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
 
 static struct panthor_queue *
 group_create_queue(struct panthor_group *group,
-		   const struct drm_panthor_queue_create *args)
+		   const struct drm_panthor_queue_create *args,
+		   unsigned int slots_so_far)
 {
 	struct drm_gpu_scheduler *drm_sched;
 	struct panthor_queue *queue;
@@ -3038,9 +3167,17 @@ group_create_queue(struct panthor_group *group,
 		goto err_free_queue;
 	}
 
+	queue->time_offset = group->syncobjs.job_times_offset +
+		(slots_so_far * sizeof(struct panthor_job_times));
+
+	/*
+	 * Credit limit argument tells us the total number of instructions
+	 * across all CS slots in the ringbuffer, with some jobs requiring
+	 * twice as many as others, depending on their profiling status.
+	 */
 	ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
 			     group->ptdev->scheduler->wq, 1,
-			     args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
+			     args->ringbuf_size / sizeof(u64),
 			     0, msecs_to_jiffies(JOB_TIMEOUT_MS),
 			     group->ptdev->reset.wq,
 			     NULL, "panthor-queue", group->ptdev->base.dev);
@@ -3068,7 +3205,9 @@ int panthor_group_create(struct panthor_file *pfile,
 	struct panthor_scheduler *sched = ptdev->scheduler;
 	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, 0);
 	struct panthor_group *group = NULL;
+	unsigned int total_slots;
 	u32 gid, i, suspend_size;
+	size_t syncobj_bo_size;
 	int ret;
 
 	if (group_args->pad)
@@ -3134,33 +3273,75 @@ int panthor_group_create(struct panthor_file *pfile,
 		goto err_put_group;
 	}
 
-	group->syncobjs = panthor_kernel_bo_create(ptdev, group->vm,
-						   group_args->queues.count *
-						   sizeof(struct panthor_syncobj_64b),
-						   DRM_PANTHOR_BO_NO_MMAP,
-						   DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
-						   DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
-						   PANTHOR_VM_KERNEL_AUTO_VA);
-	if (IS_ERR(group->syncobjs)) {
-		ret = PTR_ERR(group->syncobjs);
+	/*
+	 * Need to add size for the panthor_job_times structs, as many as the sum
+	 * of the number of job slots for every single queue ringbuffer.
+	 */
+	for (i = 0, total_slots = 0; i < group_args->queues.count; i++)
+		total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
+
+	syncobj_bo_size = (group_args->queues.count * sizeof(struct panthor_syncobj_64b))
+		+ (total_slots * sizeof(struct panthor_job_times));
+
+	/*
+	 * Memory layout of group's syncobjs BO
+	 * group->syncobjs.bo {
+	 *	struct panthor_syncobj_64b sync1;
+	 *	struct panthor_syncobj_64b sync2;
+	 *		...
+	 *		As many as group_args->queues.count
+	 *		...
+	 *	struct panthor_syncobj_64b syncn;
+	 *	struct panthor_job_times queue1_slot1
+	 *	struct panthor_job_times queue1_slot2
+	 *		...
+	 *		As many as queue[i].ringbuf_size / SLOTSIZE
+	 *		...
+	 *	struct panthor_job_times queue1_slotP
+	 *		...
+	 *		As many as group_args->queues.count
+	 *		...
+	 *	struct panthor_job_times queueN_slot1
+	 *	struct panthor_job_times queueN_slot2
+	 *		...
+	 *		As many as queue[n].ringbuf_size / SLOTSIZE
+	 *	struct panthor_job_times queueN_slotQ
+	 *
+	 *	Linearly, group->syncobjs.bo = {syncojb1,..,syncobjN,
+	 *	{queue1 = {js1,..,jsP},..,queueN = {js1,..,jsQ}}}
+	 * }
+	 *
+	 */
+
+	group->syncobjs.bo = panthor_kernel_bo_create(ptdev, group->vm,
+						      syncobj_bo_size,
+						      DRM_PANTHOR_BO_NO_MMAP,
+						      DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
+						      DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
+						      PANTHOR_VM_KERNEL_AUTO_VA);
+	if (IS_ERR(group->syncobjs.bo)) {
+		ret = PTR_ERR(group->syncobjs.bo);
 		goto err_put_group;
 	}
 
-	ret = panthor_kernel_bo_vmap(group->syncobjs);
+	ret = panthor_kernel_bo_vmap(group->syncobjs.bo);
 	if (ret)
 		goto err_put_group;
 
-	memset(group->syncobjs->kmap, 0,
-	       group_args->queues.count * sizeof(struct panthor_syncobj_64b));
+	memset(group->syncobjs.bo->kmap, 0, syncobj_bo_size);
 
-	for (i = 0; i < group_args->queues.count; i++) {
-		group->queues[i] = group_create_queue(group, &queue_args[i]);
+	group->syncobjs.job_times_offset =
+		group_args->queues.count * sizeof(struct panthor_syncobj_64b);
+
+	for (i = 0, total_slots = 0; i < group_args->queues.count; i++) {
+		group->queues[i] = group_create_queue(group, &queue_args[i], total_slots);
 		if (IS_ERR(group->queues[i])) {
 			ret = PTR_ERR(group->queues[i]);
 			group->queues[i] = NULL;
 			goto err_put_group;
 		}
 
+		total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
 		group->queue_count++;
 	}
 
@@ -3384,9 +3565,12 @@ panthor_job_create(struct panthor_file *pfile,
 		goto err_put_job;
 	}
 
+	job->is_profiled = pfile->ptdev->profile_mode;
+
 	ret = drm_sched_job_init(&job->base,
 				 &job->group->queues[job->queue_idx]->entity,
-				 1, job->group);
+				 job->is_profiled ? NUM_INSTRS_PER_SLOT * 2 :
+				 NUM_INSTRS_PER_SLOT, job->group);
 	if (ret)
 		goto err_put_job;
 
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 2/4] drm/panthor: add DRM fdinfo support
  2024-07-16 20:11 [PATCH v4 0/4] Support fdinfo runtime and memory stats on Panthor Adrián Larumbe
  2024-07-16 20:11 ` [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
@ 2024-07-16 20:11 ` Adrián Larumbe
  2024-07-19 14:14   ` Steven Price
  2024-07-16 20:11 ` [PATCH v4 3/4] drm/panthor: enable fdinfo for memory stats Adrián Larumbe
  2024-07-16 20:11 ` [PATCH v4 4/4] drm/panthor: add sysfs knob for enabling job profiling Adrián Larumbe
  3 siblings, 1 reply; 13+ messages in thread
From: Adrián Larumbe @ 2024-07-16 20:11 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter
  Cc: kernel, Adrián Larumbe, dri-devel, linux-kernel

Drawing from the FW-calculated values in the previous commit, we can
increase the numbers for an open file by collecting them from finished jobs
when updating their group synchronisation objects.

Display of fdinfo key-value pairs is governed by a flag that is by default
disabled in the present commit, and supporting manual toggle of it will be
the matter of a later commit.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_devfreq.c | 18 ++++++++-
 drivers/gpu/drm/panthor/panthor_device.h  | 10 +++++
 drivers/gpu/drm/panthor/panthor_drv.c     | 33 ++++++++++++++++
 drivers/gpu/drm/panthor/panthor_sched.c   | 47 +++++++++++++++++++++++
 4 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
index c6d3c327cc24..9d0f891b9b53 100644
--- a/drivers/gpu/drm/panthor/panthor_devfreq.c
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
@@ -62,14 +62,20 @@ static void panthor_devfreq_update_utilization(struct panthor_devfreq *pdevfreq)
 static int panthor_devfreq_target(struct device *dev, unsigned long *freq,
 				  u32 flags)
 {
+	struct panthor_device *ptdev = dev_get_drvdata(dev);
 	struct dev_pm_opp *opp;
+	int err;
 
 	opp = devfreq_recommended_opp(dev, freq, flags);
 	if (IS_ERR(opp))
 		return PTR_ERR(opp);
 	dev_pm_opp_put(opp);
 
-	return dev_pm_opp_set_rate(dev, *freq);
+	err = dev_pm_opp_set_rate(dev, *freq);
+	if (!err)
+		ptdev->current_frequency = *freq;
+
+	return err;
 }
 
 static void panthor_devfreq_reset(struct panthor_devfreq *pdevfreq)
@@ -130,6 +136,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
 	struct panthor_devfreq *pdevfreq;
 	struct dev_pm_opp *opp;
 	unsigned long cur_freq;
+	unsigned long freq = ULONG_MAX;
 	int ret;
 
 	pdevfreq = drmm_kzalloc(&ptdev->base, sizeof(*ptdev->devfreq), GFP_KERNEL);
@@ -161,6 +168,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
 		return PTR_ERR(opp);
 
 	panthor_devfreq_profile.initial_freq = cur_freq;
+	ptdev->current_frequency = cur_freq;
 
 	/* Regulator coupling only takes care of synchronizing/balancing voltage
 	 * updates, but the coupled regulator needs to be enabled manually.
@@ -204,6 +212,14 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
 
 	dev_pm_opp_put(opp);
 
+	/* Find the fastest defined rate  */
+	opp = dev_pm_opp_find_freq_floor(dev, &freq);
+	if (IS_ERR(opp))
+		return PTR_ERR(opp);
+	ptdev->fast_rate = freq;
+
+	dev_pm_opp_put(opp);
+
 	/*
 	 * Setup default thresholds for the simple_ondemand governor.
 	 * The values are chosen based on experiments.
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 3ede2f80df73..4536fbf43a4e 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -163,9 +163,16 @@ struct panthor_device {
 		struct page *dummy_latest_flush;
 	} pm;
 
+	unsigned long current_frequency;
+	unsigned long fast_rate;
 	bool profile_mode;
 };
 
+struct panthor_gpu_usage {
+	u64 time;
+	u64 cycles;
+};
+
 /**
  * struct panthor_file - Panthor file
  */
@@ -178,6 +185,9 @@ struct panthor_file {
 
 	/** @groups: Scheduling group pool attached to this file. */
 	struct panthor_group_pool *groups;
+
+	/** @stats: cycle and timestamp measures for job execution. */
+	struct panthor_gpu_usage stats;
 };
 
 int panthor_device_init(struct panthor_device *ptdev);
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index b8a84f26b3ef..6a0c1a06a709 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -3,12 +3,17 @@
 /* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
 /* Copyright 2019 Collabora ltd. */
 
+#ifdef CONFIG_ARM_ARCH_TIMER
+#include <asm/arch_timer.h>
+#endif
+
 #include <linux/list.h>
 #include <linux/module.h>
 #include <linux/of_platform.h>
 #include <linux/pagemap.h>
 #include <linux/platform_device.h>
 #include <linux/pm_runtime.h>
+#include <linux/time64.h>
 
 #include <drm/drm_debugfs.h>
 #include <drm/drm_drv.h>
@@ -1351,6 +1356,32 @@ static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
 	return ret;
 }
 
+static void panthor_gpu_show_fdinfo(struct panthor_device *ptdev,
+				    struct panthor_file *pfile,
+				    struct drm_printer *p)
+{
+	if (ptdev->profile_mode) {
+#ifdef CONFIG_ARM_ARCH_TIMER
+		drm_printf(p, "drm-engine-panthor:\t%llu ns\n",
+			   DIV_ROUND_UP_ULL((pfile->stats.time * NSEC_PER_SEC),
+					    arch_timer_get_cntfrq()));
+#endif
+		drm_printf(p, "drm-cycles-panthor:\t%llu\n", pfile->stats.cycles);
+	}
+	drm_printf(p, "drm-maxfreq-panthor:\t%lu Hz\n", ptdev->fast_rate);
+	drm_printf(p, "drm-curfreq-panthor:\t%lu Hz\n", ptdev->current_frequency);
+}
+
+static void panthor_show_fdinfo(struct drm_printer *p, struct drm_file *file)
+{
+	struct drm_device *dev = file->minor->dev;
+	struct panthor_device *ptdev = container_of(dev, struct panthor_device, base);
+
+	panthor_gpu_show_fdinfo(ptdev, file->driver_priv, p);
+
+	drm_show_memory_stats(p, file);
+}
+
 static const struct file_operations panthor_drm_driver_fops = {
 	.open = drm_open,
 	.release = drm_release,
@@ -1360,6 +1391,7 @@ static const struct file_operations panthor_drm_driver_fops = {
 	.read = drm_read,
 	.llseek = noop_llseek,
 	.mmap = panthor_mmap,
+	.show_fdinfo = drm_show_fdinfo,
 };
 
 #ifdef CONFIG_DEBUG_FS
@@ -1378,6 +1410,7 @@ static const struct drm_driver panthor_drm_driver = {
 			   DRIVER_SYNCOBJ_TIMELINE | DRIVER_GEM_GPUVA,
 	.open = panthor_open,
 	.postclose = panthor_postclose,
+	.show_fdinfo = panthor_show_fdinfo,
 	.ioctls = panthor_drm_driver_ioctls,
 	.num_ioctls = ARRAY_SIZE(panthor_drm_driver_ioctls),
 	.fops = &panthor_drm_driver_fops,
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 6438e5ea1f2b..4fb6fc5c2314 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -610,6 +610,18 @@ struct panthor_group {
 		size_t job_times_offset;
 	} syncobjs;
 
+	/** @fdinfo: Per-file total cycle and timestamp values reference. */
+	struct {
+		/** @data: Pointer to actual per-file sample data. */
+		struct panthor_gpu_usage *data;
+
+		/**
+		 * @lock: Mutex to govern concurrent access from drm file's fdinfo callback
+		 * and job post-completion processing function
+		 */
+		struct mutex lock;
+	} fdinfo;
+
 	/** @state: Group state. */
 	enum panthor_group_state state;
 
@@ -873,6 +885,8 @@ static void group_release_work(struct work_struct *work)
 						   release_work);
 	u32 i;
 
+	mutex_destroy(&group->fdinfo.lock);
+
 	for (i = 0; i < group->queue_count; i++)
 		group_free_queue(group, group->queues[i]);
 
@@ -2795,6 +2809,30 @@ void panthor_sched_post_reset(struct panthor_device *ptdev, bool reset_failed)
 	}
 }
 
+static void update_fdinfo_stats(struct panthor_job *job)
+{
+	struct panthor_group *group = job->group;
+	struct panthor_queue *queue = group->queues[job->queue_idx];
+	struct panthor_device *ptdev = group->ptdev;
+	struct panthor_gpu_usage *fdinfo;
+	struct panthor_job_times *times;
+
+	drm_WARN_ON(&ptdev->base, job->ringbuf_idx >=
+		    panthor_kernel_bo_size(queue->ringbuf) / (SLOTSIZE));
+
+	times = (struct panthor_job_times *)
+		((unsigned long)group->syncobjs.bo->kmap + queue->time_offset +
+		 (job->ringbuf_idx * sizeof(struct panthor_job_times)));
+
+	mutex_lock(&group->fdinfo.lock);
+	if ((group->fdinfo.data)) {
+		fdinfo = group->fdinfo.data;
+		fdinfo->cycles += times->cycles.after - times->cycles.before;
+		fdinfo->time += times->time.after - times->time.before;
+	}
+	mutex_unlock(&group->fdinfo.lock);
+}
+
 static void group_sync_upd_work(struct work_struct *work)
 {
 	struct panthor_group *group =
@@ -2830,6 +2868,8 @@ static void group_sync_upd_work(struct work_struct *work)
 	dma_fence_end_signalling(cookie);
 
 	list_for_each_entry_safe(job, job_tmp, &done_jobs, node) {
+		if (job->is_profiled)
+			update_fdinfo_stats(job);
 		list_del_init(&job->node);
 		panthor_job_put(&job->base);
 	}
@@ -3362,6 +3402,9 @@ int panthor_group_create(struct panthor_file *pfile,
 	}
 	mutex_unlock(&sched->reset.lock);
 
+	group->fdinfo.data = &pfile->stats;
+	mutex_init(&group->fdinfo.lock);
+
 	return gid;
 
 err_put_group:
@@ -3401,6 +3444,10 @@ int panthor_group_destroy(struct panthor_file *pfile, u32 group_handle)
 	mutex_unlock(&sched->lock);
 	mutex_unlock(&sched->reset.lock);
 
+	mutex_lock(&group->fdinfo.lock);
+	group->fdinfo.data = NULL;
+	mutex_unlock(&group->fdinfo.lock);
+
 	group_put(group);
 	return 0;
 }
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 3/4] drm/panthor: enable fdinfo for memory stats
  2024-07-16 20:11 [PATCH v4 0/4] Support fdinfo runtime and memory stats on Panthor Adrián Larumbe
  2024-07-16 20:11 ` [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
  2024-07-16 20:11 ` [PATCH v4 2/4] drm/panthor: add DRM fdinfo support Adrián Larumbe
@ 2024-07-16 20:11 ` Adrián Larumbe
  2024-07-19 14:14   ` Steven Price
  2024-07-16 20:11 ` [PATCH v4 4/4] drm/panthor: add sysfs knob for enabling job profiling Adrián Larumbe
  3 siblings, 1 reply; 13+ messages in thread
From: Adrián Larumbe @ 2024-07-16 20:11 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter
  Cc: kernel, Adrián Larumbe, dri-devel, linux-kernel

Implement drm object's status callback.

Also, we consider a PRIME imported BO to be resident if its matching
dma_buf has an open attachment, which means its backing storage had already
been allocated.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
---
 drivers/gpu/drm/panthor/panthor_gem.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
index 38f560864879..c60b599665d8 100644
--- a/drivers/gpu/drm/panthor/panthor_gem.c
+++ b/drivers/gpu/drm/panthor/panthor_gem.c
@@ -145,6 +145,17 @@ panthor_gem_prime_export(struct drm_gem_object *obj, int flags)
 	return drm_gem_prime_export(obj, flags);
 }
 
+static enum drm_gem_object_status panthor_gem_status(struct drm_gem_object *obj)
+{
+	struct panthor_gem_object *bo = to_panthor_bo(obj);
+	enum drm_gem_object_status res = 0;
+
+	if (bo->base.base.import_attach || bo->base.pages)
+		res |= DRM_GEM_OBJECT_RESIDENT;
+
+	return res;
+}
+
 static const struct drm_gem_object_funcs panthor_gem_funcs = {
 	.free = panthor_gem_free_object,
 	.print_info = drm_gem_shmem_object_print_info,
@@ -154,6 +165,7 @@ static const struct drm_gem_object_funcs panthor_gem_funcs = {
 	.vmap = drm_gem_shmem_object_vmap,
 	.vunmap = drm_gem_shmem_object_vunmap,
 	.mmap = panthor_gem_mmap,
+	.status = panthor_gem_status,
 	.export = panthor_gem_prime_export,
 	.vm_ops = &drm_gem_shmem_vm_ops,
 };
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 4/4] drm/panthor: add sysfs knob for enabling job profiling
  2024-07-16 20:11 [PATCH v4 0/4] Support fdinfo runtime and memory stats on Panthor Adrián Larumbe
                   ` (2 preceding siblings ...)
  2024-07-16 20:11 ` [PATCH v4 3/4] drm/panthor: enable fdinfo for memory stats Adrián Larumbe
@ 2024-07-16 20:11 ` Adrián Larumbe
  2024-07-19 14:15   ` Steven Price
  3 siblings, 1 reply; 13+ messages in thread
From: Adrián Larumbe @ 2024-07-16 20:11 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter
  Cc: kernel, Adrián Larumbe, dri-devel, linux-kernel

This commit introduces a DRM device sysfs attribute that lets UM control
the job accounting status in the device. The knob variable had been brought
in as part of a previous commit, but now we're able to fix it manually.

Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_drv.c | 36 +++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index 6a0c1a06a709..a2876310856f 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1448,6 +1448,41 @@ static void panthor_remove(struct platform_device *pdev)
 	panthor_device_unplug(ptdev);
 }
 
+static ssize_t profiling_show(struct device *dev,
+			      struct device_attribute *attr,
+			      char *buf)
+{
+	struct panthor_device *ptdev = dev_get_drvdata(dev);
+
+	return sysfs_emit(buf, "%d\n", ptdev->profile_mode);
+}
+
+static ssize_t profiling_store(struct device *dev,
+			       struct device_attribute *attr,
+			       const char *buf, size_t len)
+{
+	struct panthor_device *ptdev = dev_get_drvdata(dev);
+	bool value;
+	int err;
+
+	err = kstrtobool(buf, &value);
+	if (err)
+		return err;
+
+	ptdev->profile_mode = value;
+
+	return len;
+}
+
+static DEVICE_ATTR_RW(profiling);
+
+static struct attribute *panthor_attrs[] = {
+	&dev_attr_profiling.attr,
+	NULL,
+};
+
+ATTRIBUTE_GROUPS(panthor);
+
 static const struct of_device_id dt_match[] = {
 	{ .compatible = "rockchip,rk3588-mali" },
 	{ .compatible = "arm,mali-valhall-csf" },
@@ -1467,6 +1502,7 @@ static struct platform_driver panthor_driver = {
 		.name = "panthor",
 		.pm = pm_ptr(&panthor_pm_ops),
 		.of_match_table = dt_match,
+		.dev_groups = panthor_groups,
 	},
 };
 
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-07-16 20:11 ` [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
@ 2024-07-19 14:14   ` Steven Price
  2024-07-31 12:41     ` Adrián Larumbe
  2024-08-19 12:22   ` Boris Brezillon
  1 sibling, 1 reply; 13+ messages in thread
From: Steven Price @ 2024-07-19 14:14 UTC (permalink / raw)
  To: Adrián Larumbe, Boris Brezillon, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Daniel Vetter
  Cc: kernel, dri-devel, linux-kernel

On 16/07/2024 21:11, Adrián Larumbe wrote:
> Enable calculations of job submission times in clock cycles and wall
> time. This is done by expanding the boilerplate command stream when running
> a job to include instructions that compute said times right before an after
> a user CS.
> 
> Those numbers are stored in the queue's group's sync objects BO, right
> after them. Because the queues in a group might have a different number of
> slots, one must keep track of the overall slot tally when reckoning the
> offset of a queue's time sample structs, one for each slot.
> 
> This commit is done in preparation for enabling DRM fdinfo support in the
> Panthor driver, which depends on the numbers calculated herein.
> 
> A profile mode device flag has been added that will in a future commit
> allow UM to toggle time sampling behaviour, which is disabled by default to
> save power. It also enables marking jobs as being profiled and picks one of
> two call instruction arrays to insert into the ring buffer. One of them
> includes FW logic to sample the timestamp and cycle counter registers and
> write them into the job's syncobj, and the other does not.
> 
> A profiled job's call sequence takes up two ring buffer slots, and this is
> reflected when initialising the DRM scheduler for each queue, with a
> profiled job contributing twice as many credits.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>

Thanks for the updates, this looks better. A few minor comments below.

> ---
>  drivers/gpu/drm/panthor/panthor_device.h |   2 +
>  drivers/gpu/drm/panthor/panthor_sched.c  | 244 ++++++++++++++++++++---
>  2 files changed, 216 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index e388c0472ba7..3ede2f80df73 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -162,6 +162,8 @@ struct panthor_device {
>  		 */
>  		struct page *dummy_latest_flush;
>  	} pm;
> +
> +	bool profile_mode;
>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index 79ffcbc41d78..6438e5ea1f2b 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -93,6 +93,9 @@
>  #define MIN_CSGS				3
>  #define MAX_CSG_PRIO				0xf
>  
> +#define NUM_INSTRS_PER_SLOT			16
> +#define SLOTSIZE				(NUM_INSTRS_PER_SLOT * sizeof(u64))
> +
>  struct panthor_group;
>  
>  /**
> @@ -466,6 +469,9 @@ struct panthor_queue {
>  		 */
>  		struct list_head in_flight_jobs;
>  	} fence_ctx;
> +
> +	/** @time_offset: Offset of panthor_job_times structs in group's syncobj bo. */
> +	unsigned long time_offset;

AFAICT this doesn't need to be stored. We could just pass this value
into group_create_queue() as an extra parameter where it's used.

>  };
>  
>  /**
> @@ -592,7 +598,17 @@ struct panthor_group {
>  	 * One sync object per queue. The position of the sync object is
>  	 * determined by the queue index.
>  	 */
> -	struct panthor_kernel_bo *syncobjs;
> +
> +	struct {
> +		/** @bo: Kernel BO holding the sync objects. */
> +		struct panthor_kernel_bo *bo;
> +
> +		/**
> +		 * @job_times_offset: Beginning of panthor_job_times struct samples after
> +		 * the group's array of sync objects.
> +		 */
> +		size_t job_times_offset;
> +	} syncobjs;
>  
>  	/** @state: Group state. */
>  	enum panthor_group_state state;
> @@ -651,6 +667,18 @@ struct panthor_group {
>  	struct list_head wait_node;
>  };
>  
> +struct panthor_job_times {
> +	struct {
> +		u64 before;
> +		u64 after;
> +	} cycles;
> +
> +	struct {
> +		u64 before;
> +		u64 after;
> +	} time;
> +};
> +
>  /**
>   * group_queue_work() - Queue a group work
>   * @group: Group to queue the work for.
> @@ -730,6 +758,9 @@ struct panthor_job {
>  	/** @queue_idx: Index of the queue inside @group. */
>  	u32 queue_idx;
>  
> +	/** @ringbuf_idx: Index of the ringbuffer inside @queue. */
> +	u32 ringbuf_idx;
> +
>  	/** @call_info: Information about the userspace command stream call. */
>  	struct {
>  		/** @start: GPU address of the userspace command stream. */
> @@ -764,6 +795,9 @@ struct panthor_job {
>  
>  	/** @done_fence: Fence signaled when the job is finished or cancelled. */
>  	struct dma_fence *done_fence;
> +
> +	/** @is_profiled: Whether timestamp and cycle numbers were gathered for this job */
> +	bool is_profiled;
>  };
>  
>  static void
> @@ -844,7 +878,7 @@ static void group_release_work(struct work_struct *work)
>  
>  	panthor_kernel_bo_destroy(group->suspend_buf);
>  	panthor_kernel_bo_destroy(group->protm_suspend_buf);
> -	panthor_kernel_bo_destroy(group->syncobjs);
> +	panthor_kernel_bo_destroy(group->syncobjs.bo);
>  
>  	panthor_vm_put(group->vm);
>  	kfree(group);
> @@ -1969,8 +2003,6 @@ tick_ctx_init(struct panthor_scheduler *sched,
>  	}
>  }
>  
> -#define NUM_INSTRS_PER_SLOT		16
> -
>  static void
>  group_term_post_processing(struct panthor_group *group)
>  {
> @@ -2007,7 +2039,7 @@ group_term_post_processing(struct panthor_group *group)
>  		spin_unlock(&queue->fence_ctx.lock);
>  
>  		/* Manually update the syncobj seqno to unblock waiters. */
> -		syncobj = group->syncobjs->kmap + (i * sizeof(*syncobj));
> +		syncobj = group->syncobjs.bo->kmap + (i * sizeof(*syncobj));
>  		syncobj->status = ~0;
>  		syncobj->seqno = atomic64_read(&queue->fence_ctx.seqno);
>  		sched_queue_work(group->ptdev->scheduler, sync_upd);
> @@ -2780,7 +2812,7 @@ static void group_sync_upd_work(struct work_struct *work)
>  		if (!queue)
>  			continue;
>  
> -		syncobj = group->syncobjs->kmap + (queue_idx * sizeof(*syncobj));
> +		syncobj = group->syncobjs.bo->kmap + (queue_idx * sizeof(*syncobj));
>  
>  		spin_lock(&queue->fence_ctx.lock);
>  		list_for_each_entry_safe(job, job_tmp, &queue->fence_ctx.in_flight_jobs, node) {
> @@ -2815,22 +2847,81 @@ queue_run_job(struct drm_sched_job *sched_job)
>  	struct panthor_scheduler *sched = ptdev->scheduler;
>  	u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
>  	u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1);
> +	u32 ringbuf_index = ringbuf_insert / (SLOTSIZE);
> +	bool ringbuf_wraparound =
> +		job->is_profiled && ((ringbuf_size/SLOTSIZE) == ringbuf_index + 1);
>  	u64 addr_reg = ptdev->csif_info.cs_reg_count -
>  		       ptdev->csif_info.unpreserved_cs_reg_count;
>  	u64 val_reg = addr_reg + 2;
> -	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> -			job->queue_idx * sizeof(struct panthor_syncobj_64b);
> +	u64 cycle_reg = addr_reg;
> +	u64 time_reg = val_reg;
> +	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) +
> +		job->queue_idx * sizeof(struct panthor_syncobj_64b);
> +	u64 times_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) + queue->time_offset +
> +		(ringbuf_index * sizeof(struct panthor_job_times));
> +	size_t call_insrt_size;

NIT: s/insrt/instr/

> +	u64 *call_instrs;
> +
>  	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
>  	struct dma_fence *done_fence;
>  	int ret;
>  
> -	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
> +	u64 call_instrs_simple[NUM_INSTRS_PER_SLOT] = {
> +		/* MOV32 rX+2, cs.latest_flush */
> +		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
> +
> +		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
> +		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
> +
> +		/* MOV48 rX:rX+1, cs.start */
> +		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
> +
> +		/* MOV32 rX+2, cs.size */
> +		(2ull << 56) | (val_reg << 48) | job->call_info.size,
> +
> +		/* WAIT(0) => waits for FLUSH_CACHE2 instruction */
> +		(3ull << 56) | (1 << 16),
> +
> +		/* CALL rX:rX+1, rX+2 */
> +		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
> +
> +		/* MOV48 rX:rX+1, sync_addr */
> +		(1ull << 56) | (addr_reg << 48) | sync_addr,
> +
> +		/* MOV48 rX+2, #1 */
> +		(1ull << 56) | (val_reg << 48) | 1,
> +
> +		/* WAIT(all) */
> +		(3ull << 56) | (waitall_mask << 16),
> +
> +		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
> +		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
> +
> +		/* ERROR_BARRIER, so we can recover from faults at job
> +		 * boundaries.
> +		 */
> +		(47ull << 56),
> +	};
> +
> +	u64 call_instrs_profile[NUM_INSTRS_PER_SLOT*2] = {
>  		/* MOV32 rX+2, cs.latest_flush */
>  		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
>  
>  		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
>  		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
>  
> +		/* MOV48 rX:rX+1, cycles_offset */
> +		(1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.before)),
> +
> +		/* MOV48 rX:rX+1, time_offset */
> +		(1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.before)),
> +
> +		/* STORE_STATE cycles */
> +		(40ull << 56) |  (cycle_reg << 40) | (1ll << 32),
> +
> +		/* STORE_STATE timer */
> +		(40ull << 56) |  (time_reg << 40) | (0ll << 32),
> +
>  		/* MOV48 rX:rX+1, cs.start */
>  		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
>  
> @@ -2843,6 +2934,18 @@ queue_run_job(struct drm_sched_job *sched_job)
>  		/* CALL rX:rX+1, rX+2 */
>  		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
>  
> +		/* MOV48 rX:rX+1, cycles_offset */
> +		(1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.after)),
> +
> +		/* MOV48 rX:rX+1, time_offset */
> +		(1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.after)),
> +
> +		/* STORE_STATE cycles */
> +		(40ull << 56) |  (cycle_reg << 40) | (1ll << 32),
> +
> +		/* STORE_STATE timer */
> +		(40ull << 56) |  (time_reg << 40) | (0ll << 32),
> +
>  		/* MOV48 rX:rX+1, sync_addr */
>  		(1ull << 56) | (addr_reg << 48) | sync_addr,
>  
> @@ -2862,9 +2965,18 @@ queue_run_job(struct drm_sched_job *sched_job)
>  	};
>  
>  	/* Need to be cacheline aligned to please the prefetcher. */
> -	static_assert(sizeof(call_instrs) % 64 == 0,
> +	static_assert(sizeof(call_instrs_simple) % 64 == 0 && sizeof(call_instrs_profile) % 64 == 0,
>  		      "call_instrs is not aligned on a cacheline");
>  
> +	if (job->is_profiled) {
> +		call_instrs = call_instrs_profile;
> +		call_insrt_size = sizeof(call_instrs_profile);
> +
> +	} else {
> +		call_instrs = call_instrs_simple;
> +		call_insrt_size = sizeof(call_instrs_simple);
> +	}
> +
>  	/* Stream size is zero, nothing to do => return a NULL fence and let
>  	 * drm_sched signal the parent.
>  	 */
> @@ -2887,8 +2999,23 @@ queue_run_job(struct drm_sched_job *sched_job)
>  		       queue->fence_ctx.id,
>  		       atomic64_inc_return(&queue->fence_ctx.seqno));
>  
> -	memcpy(queue->ringbuf->kmap + ringbuf_insert,
> -	       call_instrs, sizeof(call_instrs));
> +	/*
> +	 * Need to handle the wrap-around case when copying profiled instructions
> +	 * from an odd-indexed slot. The reason this can happen is user space is
> +	 * able to control the profiling status of the driver through a sysfs
> +	 * knob, so this might lead to a timestamp and cycles-profiling call
> +	 * instruction stream beginning at an odd-number slot. The GPU should
> +	 * be able to gracefully handle this.
> +	 */
> +	if (!ringbuf_wraparound) {
> +		memcpy(queue->ringbuf->kmap + ringbuf_insert,
> +		       call_instrs, call_insrt_size);
> +	} else {
> +		memcpy(queue->ringbuf->kmap + ringbuf_insert,
> +		       call_instrs, call_insrt_size/2);
> +		memcpy(queue->ringbuf->kmap, call_instrs +
> +		       NUM_INSTRS_PER_SLOT, call_insrt_size/2);
> +	}

A minor point, but I think it would just be easier to always do the copy
in two parts:

	memcpy(queue->ringbuf->kmap + ringbuf_insert,
	       call_instrs, NUM_INSTRS_PER_SLOT);
	if (profiling) {
		ringbuf_insert += NUM_INSTRS_PER_SLOT;
		ringbuf_insert &= (ringbuf_size - 1);
		memcpy(queue->ringbuf->kmap + ringbuf_insert,
		       call_instrs + NUM_INSTRS_PER_SLOT,
		       NUM_INSTRS_PER_SLOT);
	}

>  
>  	panthor_job_get(&job->base);
>  	spin_lock(&queue->fence_ctx.lock);
> @@ -2896,7 +3023,8 @@ queue_run_job(struct drm_sched_job *sched_job)
>  	spin_unlock(&queue->fence_ctx.lock);
>  
>  	job->ringbuf.start = queue->iface.input->insert;
> -	job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
> +	job->ringbuf.end = job->ringbuf.start + call_insrt_size;
> +	job->ringbuf_idx = ringbuf_index;
>  
>  	/* Make sure the ring buffer is updated before the INSERT
>  	 * register.
> @@ -2987,7 +3115,8 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>  
>  static struct panthor_queue *
>  group_create_queue(struct panthor_group *group,
> -		   const struct drm_panthor_queue_create *args)
> +		   const struct drm_panthor_queue_create *args,
> +		   unsigned int slots_so_far)

I'm not sure I like the name "slots_so_far", "slot_offset" maybe?
Although my main gripe is below...

>  {
>  	struct drm_gpu_scheduler *drm_sched;
>  	struct panthor_queue *queue;
> @@ -3038,9 +3167,17 @@ group_create_queue(struct panthor_group *group,
>  		goto err_free_queue;
>  	}
>  
> +	queue->time_offset = group->syncobjs.job_times_offset +
> +		(slots_so_far * sizeof(struct panthor_job_times));
> +
> +	/*
> +	 * Credit limit argument tells us the total number of instructions
> +	 * across all CS slots in the ringbuffer, with some jobs requiring
> +	 * twice as many as others, depending on their profiling status.
> +	 */
>  	ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>  			     group->ptdev->scheduler->wq, 1,
> -			     args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
> +			     args->ringbuf_size / sizeof(u64),
>  			     0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>  			     group->ptdev->reset.wq,
>  			     NULL, "panthor-queue", group->ptdev->base.dev);
> @@ -3068,7 +3205,9 @@ int panthor_group_create(struct panthor_file *pfile,
>  	struct panthor_scheduler *sched = ptdev->scheduler;
>  	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, 0);
>  	struct panthor_group *group = NULL;
> +	unsigned int total_slots;
>  	u32 gid, i, suspend_size;
> +	size_t syncobj_bo_size;
>  	int ret;
>  
>  	if (group_args->pad)
> @@ -3134,33 +3273,75 @@ int panthor_group_create(struct panthor_file *pfile,
>  		goto err_put_group;
>  	}
>  
> -	group->syncobjs = panthor_kernel_bo_create(ptdev, group->vm,
> -						   group_args->queues.count *
> -						   sizeof(struct panthor_syncobj_64b),
> -						   DRM_PANTHOR_BO_NO_MMAP,
> -						   DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> -						   DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> -						   PANTHOR_VM_KERNEL_AUTO_VA);
> -	if (IS_ERR(group->syncobjs)) {
> -		ret = PTR_ERR(group->syncobjs);
> +	/*
> +	 * Need to add size for the panthor_job_times structs, as many as the sum
> +	 * of the number of job slots for every single queue ringbuffer.
> +	 */
> +	for (i = 0, total_slots = 0; i < group_args->queues.count; i++)
> +		total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
> +
> +	syncobj_bo_size = (group_args->queues.count * sizeof(struct panthor_syncobj_64b))
> +		+ (total_slots * sizeof(struct panthor_job_times));
> +
> +	/*
> +	 * Memory layout of group's syncobjs BO
> +	 * group->syncobjs.bo {
> +	 *	struct panthor_syncobj_64b sync1;
> +	 *	struct panthor_syncobj_64b sync2;
> +	 *		...
> +	 *		As many as group_args->queues.count
> +	 *		...
> +	 *	struct panthor_syncobj_64b syncn;
> +	 *	struct panthor_job_times queue1_slot1
> +	 *	struct panthor_job_times queue1_slot2
> +	 *		...
> +	 *		As many as queue[i].ringbuf_size / SLOTSIZE
> +	 *		...
> +	 *	struct panthor_job_times queue1_slotP
> +	 *		...
> +	 *		As many as group_args->queues.count
> +	 *		...
> +	 *	struct panthor_job_times queueN_slot1
> +	 *	struct panthor_job_times queueN_slot2
> +	 *		...
> +	 *		As many as queue[n].ringbuf_size / SLOTSIZE
> +	 *	struct panthor_job_times queueN_slotQ
> +	 *
> +	 *	Linearly, group->syncobjs.bo = {syncojb1,..,syncobjN,
> +	 *	{queue1 = {js1,..,jsP},..,queueN = {js1,..,jsQ}}}
> +	 * }
> +	 *
> +	 */
> +
> +	group->syncobjs.bo = panthor_kernel_bo_create(ptdev, group->vm,
> +						      syncobj_bo_size,
> +						      DRM_PANTHOR_BO_NO_MMAP,
> +						      DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> +						      DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> +						      PANTHOR_VM_KERNEL_AUTO_VA);
> +	if (IS_ERR(group->syncobjs.bo)) {
> +		ret = PTR_ERR(group->syncobjs.bo);
>  		goto err_put_group;
>  	}
>  
> -	ret = panthor_kernel_bo_vmap(group->syncobjs);
> +	ret = panthor_kernel_bo_vmap(group->syncobjs.bo);
>  	if (ret)
>  		goto err_put_group;
>  
> -	memset(group->syncobjs->kmap, 0,
> -	       group_args->queues.count * sizeof(struct panthor_syncobj_64b));
> +	memset(group->syncobjs.bo->kmap, 0, syncobj_bo_size);
>  
> -	for (i = 0; i < group_args->queues.count; i++) {
> -		group->queues[i] = group_create_queue(group, &queue_args[i]);
> +	group->syncobjs.job_times_offset =
> +		group_args->queues.count * sizeof(struct panthor_syncobj_64b);
> +
> +	for (i = 0, total_slots = 0; i < group_args->queues.count; i++) {

I definitely don't like this usage of total_slots. Here "slots_so_far"
would be a better name (which would justify the parameter name avoid),
and would avoid the confusion that total_slots already had a value
(which was indeed the total slots) but is now being reused for a running
counter.

Steve

> +		group->queues[i] = group_create_queue(group, &queue_args[i], total_slots);
>  		if (IS_ERR(group->queues[i])) {
>  			ret = PTR_ERR(group->queues[i]);
>  			group->queues[i] = NULL;
>  			goto err_put_group;
>  		}
>  
> +		total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
>  		group->queue_count++;
>  	}
>  
> @@ -3384,9 +3565,12 @@ panthor_job_create(struct panthor_file *pfile,
>  		goto err_put_job;
>  	}
>  
> +	job->is_profiled = pfile->ptdev->profile_mode;
> +
>  	ret = drm_sched_job_init(&job->base,
>  				 &job->group->queues[job->queue_idx]->entity,
> -				 1, job->group);
> +				 job->is_profiled ? NUM_INSTRS_PER_SLOT * 2 :
> +				 NUM_INSTRS_PER_SLOT, job->group);

Is there a good reason to make the credits the number of instructions,
rather than the number of slots? If we were going to count the actual
number of non-NOP instructions then there would be some logic (although
I'm not convinced that makes much sense), but considering we only allow
submission in "slot granules" we might as well use that as the unit of
"credit".

Steve

>  	if (ret)
>  		goto err_put_job;
>  


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 2/4] drm/panthor: add DRM fdinfo support
  2024-07-16 20:11 ` [PATCH v4 2/4] drm/panthor: add DRM fdinfo support Adrián Larumbe
@ 2024-07-19 14:14   ` Steven Price
  0 siblings, 0 replies; 13+ messages in thread
From: Steven Price @ 2024-07-19 14:14 UTC (permalink / raw)
  To: Adrián Larumbe, Boris Brezillon, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Daniel Vetter
  Cc: kernel, dri-devel, linux-kernel

On 16/07/2024 21:11, Adrián Larumbe wrote:
> Drawing from the FW-calculated values in the previous commit, we can
> increase the numbers for an open file by collecting them from finished jobs
> when updating their group synchronisation objects.
> 
> Display of fdinfo key-value pairs is governed by a flag that is by default
> disabled in the present commit, and supporting manual toggle of it will be
> the matter of a later commit.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>

Reviewed-by: Steven Price <steven.price@arm.com>

Steve

> ---
>  drivers/gpu/drm/panthor/panthor_devfreq.c | 18 ++++++++-
>  drivers/gpu/drm/panthor/panthor_device.h  | 10 +++++
>  drivers/gpu/drm/panthor/panthor_drv.c     | 33 ++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_sched.c   | 47 +++++++++++++++++++++++
>  4 files changed, 107 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
> index c6d3c327cc24..9d0f891b9b53 100644
> --- a/drivers/gpu/drm/panthor/panthor_devfreq.c
> +++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
> @@ -62,14 +62,20 @@ static void panthor_devfreq_update_utilization(struct panthor_devfreq *pdevfreq)
>  static int panthor_devfreq_target(struct device *dev, unsigned long *freq,
>  				  u32 flags)
>  {
> +	struct panthor_device *ptdev = dev_get_drvdata(dev);
>  	struct dev_pm_opp *opp;
> +	int err;
>  
>  	opp = devfreq_recommended_opp(dev, freq, flags);
>  	if (IS_ERR(opp))
>  		return PTR_ERR(opp);
>  	dev_pm_opp_put(opp);
>  
> -	return dev_pm_opp_set_rate(dev, *freq);
> +	err = dev_pm_opp_set_rate(dev, *freq);
> +	if (!err)
> +		ptdev->current_frequency = *freq;
> +
> +	return err;
>  }
>  
>  static void panthor_devfreq_reset(struct panthor_devfreq *pdevfreq)
> @@ -130,6 +136,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
>  	struct panthor_devfreq *pdevfreq;
>  	struct dev_pm_opp *opp;
>  	unsigned long cur_freq;
> +	unsigned long freq = ULONG_MAX;
>  	int ret;
>  
>  	pdevfreq = drmm_kzalloc(&ptdev->base, sizeof(*ptdev->devfreq), GFP_KERNEL);
> @@ -161,6 +168,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
>  		return PTR_ERR(opp);
>  
>  	panthor_devfreq_profile.initial_freq = cur_freq;
> +	ptdev->current_frequency = cur_freq;
>  
>  	/* Regulator coupling only takes care of synchronizing/balancing voltage
>  	 * updates, but the coupled regulator needs to be enabled manually.
> @@ -204,6 +212,14 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
>  
>  	dev_pm_opp_put(opp);
>  
> +	/* Find the fastest defined rate  */
> +	opp = dev_pm_opp_find_freq_floor(dev, &freq);
> +	if (IS_ERR(opp))
> +		return PTR_ERR(opp);
> +	ptdev->fast_rate = freq;
> +
> +	dev_pm_opp_put(opp);
> +
>  	/*
>  	 * Setup default thresholds for the simple_ondemand governor.
>  	 * The values are chosen based on experiments.
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 3ede2f80df73..4536fbf43a4e 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -163,9 +163,16 @@ struct panthor_device {
>  		struct page *dummy_latest_flush;
>  	} pm;
>  
> +	unsigned long current_frequency;
> +	unsigned long fast_rate;
>  	bool profile_mode;
>  };
>  
> +struct panthor_gpu_usage {
> +	u64 time;
> +	u64 cycles;
> +};
> +
>  /**
>   * struct panthor_file - Panthor file
>   */
> @@ -178,6 +185,9 @@ struct panthor_file {
>  
>  	/** @groups: Scheduling group pool attached to this file. */
>  	struct panthor_group_pool *groups;
> +
> +	/** @stats: cycle and timestamp measures for job execution. */
> +	struct panthor_gpu_usage stats;
>  };
>  
>  int panthor_device_init(struct panthor_device *ptdev);
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index b8a84f26b3ef..6a0c1a06a709 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -3,12 +3,17 @@
>  /* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
>  /* Copyright 2019 Collabora ltd. */
>  
> +#ifdef CONFIG_ARM_ARCH_TIMER
> +#include <asm/arch_timer.h>
> +#endif
> +
>  #include <linux/list.h>
>  #include <linux/module.h>
>  #include <linux/of_platform.h>
>  #include <linux/pagemap.h>
>  #include <linux/platform_device.h>
>  #include <linux/pm_runtime.h>
> +#include <linux/time64.h>
>  
>  #include <drm/drm_debugfs.h>
>  #include <drm/drm_drv.h>
> @@ -1351,6 +1356,32 @@ static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
>  	return ret;
>  }
>  
> +static void panthor_gpu_show_fdinfo(struct panthor_device *ptdev,
> +				    struct panthor_file *pfile,
> +				    struct drm_printer *p)
> +{
> +	if (ptdev->profile_mode) {
> +#ifdef CONFIG_ARM_ARCH_TIMER
> +		drm_printf(p, "drm-engine-panthor:\t%llu ns\n",
> +			   DIV_ROUND_UP_ULL((pfile->stats.time * NSEC_PER_SEC),
> +					    arch_timer_get_cntfrq()));
> +#endif
> +		drm_printf(p, "drm-cycles-panthor:\t%llu\n", pfile->stats.cycles);
> +	}
> +	drm_printf(p, "drm-maxfreq-panthor:\t%lu Hz\n", ptdev->fast_rate);
> +	drm_printf(p, "drm-curfreq-panthor:\t%lu Hz\n", ptdev->current_frequency);
> +}
> +
> +static void panthor_show_fdinfo(struct drm_printer *p, struct drm_file *file)
> +{
> +	struct drm_device *dev = file->minor->dev;
> +	struct panthor_device *ptdev = container_of(dev, struct panthor_device, base);
> +
> +	panthor_gpu_show_fdinfo(ptdev, file->driver_priv, p);
> +
> +	drm_show_memory_stats(p, file);
> +}
> +
>  static const struct file_operations panthor_drm_driver_fops = {
>  	.open = drm_open,
>  	.release = drm_release,
> @@ -1360,6 +1391,7 @@ static const struct file_operations panthor_drm_driver_fops = {
>  	.read = drm_read,
>  	.llseek = noop_llseek,
>  	.mmap = panthor_mmap,
> +	.show_fdinfo = drm_show_fdinfo,
>  };
>  
>  #ifdef CONFIG_DEBUG_FS
> @@ -1378,6 +1410,7 @@ static const struct drm_driver panthor_drm_driver = {
>  			   DRIVER_SYNCOBJ_TIMELINE | DRIVER_GEM_GPUVA,
>  	.open = panthor_open,
>  	.postclose = panthor_postclose,
> +	.show_fdinfo = panthor_show_fdinfo,
>  	.ioctls = panthor_drm_driver_ioctls,
>  	.num_ioctls = ARRAY_SIZE(panthor_drm_driver_ioctls),
>  	.fops = &panthor_drm_driver_fops,
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index 6438e5ea1f2b..4fb6fc5c2314 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -610,6 +610,18 @@ struct panthor_group {
>  		size_t job_times_offset;
>  	} syncobjs;
>  
> +	/** @fdinfo: Per-file total cycle and timestamp values reference. */
> +	struct {
> +		/** @data: Pointer to actual per-file sample data. */
> +		struct panthor_gpu_usage *data;
> +
> +		/**
> +		 * @lock: Mutex to govern concurrent access from drm file's fdinfo callback
> +		 * and job post-completion processing function
> +		 */
> +		struct mutex lock;
> +	} fdinfo;
> +
>  	/** @state: Group state. */
>  	enum panthor_group_state state;
>  
> @@ -873,6 +885,8 @@ static void group_release_work(struct work_struct *work)
>  						   release_work);
>  	u32 i;
>  
> +	mutex_destroy(&group->fdinfo.lock);
> +
>  	for (i = 0; i < group->queue_count; i++)
>  		group_free_queue(group, group->queues[i]);
>  
> @@ -2795,6 +2809,30 @@ void panthor_sched_post_reset(struct panthor_device *ptdev, bool reset_failed)
>  	}
>  }
>  
> +static void update_fdinfo_stats(struct panthor_job *job)
> +{
> +	struct panthor_group *group = job->group;
> +	struct panthor_queue *queue = group->queues[job->queue_idx];
> +	struct panthor_device *ptdev = group->ptdev;
> +	struct panthor_gpu_usage *fdinfo;
> +	struct panthor_job_times *times;
> +
> +	drm_WARN_ON(&ptdev->base, job->ringbuf_idx >=
> +		    panthor_kernel_bo_size(queue->ringbuf) / (SLOTSIZE));
> +
> +	times = (struct panthor_job_times *)
> +		((unsigned long)group->syncobjs.bo->kmap + queue->time_offset +
> +		 (job->ringbuf_idx * sizeof(struct panthor_job_times)));
> +
> +	mutex_lock(&group->fdinfo.lock);
> +	if ((group->fdinfo.data)) {
> +		fdinfo = group->fdinfo.data;
> +		fdinfo->cycles += times->cycles.after - times->cycles.before;
> +		fdinfo->time += times->time.after - times->time.before;
> +	}
> +	mutex_unlock(&group->fdinfo.lock);
> +}
> +
>  static void group_sync_upd_work(struct work_struct *work)
>  {
>  	struct panthor_group *group =
> @@ -2830,6 +2868,8 @@ static void group_sync_upd_work(struct work_struct *work)
>  	dma_fence_end_signalling(cookie);
>  
>  	list_for_each_entry_safe(job, job_tmp, &done_jobs, node) {
> +		if (job->is_profiled)
> +			update_fdinfo_stats(job);
>  		list_del_init(&job->node);
>  		panthor_job_put(&job->base);
>  	}
> @@ -3362,6 +3402,9 @@ int panthor_group_create(struct panthor_file *pfile,
>  	}
>  	mutex_unlock(&sched->reset.lock);
>  
> +	group->fdinfo.data = &pfile->stats;
> +	mutex_init(&group->fdinfo.lock);
> +
>  	return gid;
>  
>  err_put_group:
> @@ -3401,6 +3444,10 @@ int panthor_group_destroy(struct panthor_file *pfile, u32 group_handle)
>  	mutex_unlock(&sched->lock);
>  	mutex_unlock(&sched->reset.lock);
>  
> +	mutex_lock(&group->fdinfo.lock);
> +	group->fdinfo.data = NULL;
> +	mutex_unlock(&group->fdinfo.lock);
> +
>  	group_put(group);
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 3/4] drm/panthor: enable fdinfo for memory stats
  2024-07-16 20:11 ` [PATCH v4 3/4] drm/panthor: enable fdinfo for memory stats Adrián Larumbe
@ 2024-07-19 14:14   ` Steven Price
  0 siblings, 0 replies; 13+ messages in thread
From: Steven Price @ 2024-07-19 14:14 UTC (permalink / raw)
  To: Adrián Larumbe, Boris Brezillon, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Daniel Vetter
  Cc: kernel, dri-devel, linux-kernel

On 16/07/2024 21:11, Adrián Larumbe wrote:
> Implement drm object's status callback.
> 
> Also, we consider a PRIME imported BO to be resident if its matching
> dma_buf has an open attachment, which means its backing storage had already
> been allocated.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panthor/panthor_gem.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
> index 38f560864879..c60b599665d8 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.c
> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> @@ -145,6 +145,17 @@ panthor_gem_prime_export(struct drm_gem_object *obj, int flags)
>  	return drm_gem_prime_export(obj, flags);
>  }
>  
> +static enum drm_gem_object_status panthor_gem_status(struct drm_gem_object *obj)
> +{
> +	struct panthor_gem_object *bo = to_panthor_bo(obj);
> +	enum drm_gem_object_status res = 0;
> +
> +	if (bo->base.base.import_attach || bo->base.pages)
> +		res |= DRM_GEM_OBJECT_RESIDENT;
> +
> +	return res;
> +}
> +
>  static const struct drm_gem_object_funcs panthor_gem_funcs = {
>  	.free = panthor_gem_free_object,
>  	.print_info = drm_gem_shmem_object_print_info,
> @@ -154,6 +165,7 @@ static const struct drm_gem_object_funcs panthor_gem_funcs = {
>  	.vmap = drm_gem_shmem_object_vmap,
>  	.vunmap = drm_gem_shmem_object_vunmap,
>  	.mmap = panthor_gem_mmap,
> +	.status = panthor_gem_status,
>  	.export = panthor_gem_prime_export,
>  	.vm_ops = &drm_gem_shmem_vm_ops,
>  };


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 4/4] drm/panthor: add sysfs knob for enabling job profiling
  2024-07-16 20:11 ` [PATCH v4 4/4] drm/panthor: add sysfs knob for enabling job profiling Adrián Larumbe
@ 2024-07-19 14:15   ` Steven Price
  0 siblings, 0 replies; 13+ messages in thread
From: Steven Price @ 2024-07-19 14:15 UTC (permalink / raw)
  To: Adrián Larumbe, Boris Brezillon, Liviu Dudau,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Daniel Vetter
  Cc: kernel, dri-devel, linux-kernel

On 16/07/2024 21:11, Adrián Larumbe wrote:
> This commit introduces a DRM device sysfs attribute that lets UM control
> the job accounting status in the device. The knob variable had been brought
> in as part of a previous commit, but now we're able to fix it manually.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>

Reviewed-by: Steven Price <steven.price@arm.com>

Although we should probably copy/paste
Documentation/ABI/testing/sysfs-driver-panfrost-profiling - or at least
mention somewhere that the same knob is available for panthor.

Steve

> ---
>  drivers/gpu/drm/panthor/panthor_drv.c | 36 +++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index 6a0c1a06a709..a2876310856f 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -1448,6 +1448,41 @@ static void panthor_remove(struct platform_device *pdev)
>  	panthor_device_unplug(ptdev);
>  }
>  
> +static ssize_t profiling_show(struct device *dev,
> +			      struct device_attribute *attr,
> +			      char *buf)
> +{
> +	struct panthor_device *ptdev = dev_get_drvdata(dev);
> +
> +	return sysfs_emit(buf, "%d\n", ptdev->profile_mode);
> +}
> +
> +static ssize_t profiling_store(struct device *dev,
> +			       struct device_attribute *attr,
> +			       const char *buf, size_t len)
> +{
> +	struct panthor_device *ptdev = dev_get_drvdata(dev);
> +	bool value;
> +	int err;
> +
> +	err = kstrtobool(buf, &value);
> +	if (err)
> +		return err;
> +
> +	ptdev->profile_mode = value;
> +
> +	return len;
> +}
> +
> +static DEVICE_ATTR_RW(profiling);
> +
> +static struct attribute *panthor_attrs[] = {
> +	&dev_attr_profiling.attr,
> +	NULL,
> +};
> +
> +ATTRIBUTE_GROUPS(panthor);
> +
>  static const struct of_device_id dt_match[] = {
>  	{ .compatible = "rockchip,rk3588-mali" },
>  	{ .compatible = "arm,mali-valhall-csf" },
> @@ -1467,6 +1502,7 @@ static struct platform_driver panthor_driver = {
>  		.name = "panthor",
>  		.pm = pm_ptr(&panthor_pm_ops),
>  		.of_match_table = dt_match,
> +		.dev_groups = panthor_groups,
>  	},
>  };
>  


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-07-19 14:14   ` Steven Price
@ 2024-07-31 12:41     ` Adrián Larumbe
  2024-08-19  7:48       ` Steven Price
  0 siblings, 1 reply; 13+ messages in thread
From: Adrián Larumbe @ 2024-07-31 12:41 UTC (permalink / raw)
  To: Steven Price
  Cc: Boris Brezillon, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, kernel, dri-devel,
	linux-kernel

Hi Steven, thanks for the remarks.

On 19.07.2024 15:14, Steven Price wrote:
> On 16/07/2024 21:11, Adrián Larumbe wrote:
> > Enable calculations of job submission times in clock cycles and wall
> > time. This is done by expanding the boilerplate command stream when running
> > a job to include instructions that compute said times right before an after
> > a user CS.
> > 
> > Those numbers are stored in the queue's group's sync objects BO, right
> > after them. Because the queues in a group might have a different number of
> > slots, one must keep track of the overall slot tally when reckoning the
> > offset of a queue's time sample structs, one for each slot.
> > 
> > This commit is done in preparation for enabling DRM fdinfo support in the
> > Panthor driver, which depends on the numbers calculated herein.
> > 
> > A profile mode device flag has been added that will in a future commit
> > allow UM to toggle time sampling behaviour, which is disabled by default to
> > save power. It also enables marking jobs as being profiled and picks one of
> > two call instruction arrays to insert into the ring buffer. One of them
> > includes FW logic to sample the timestamp and cycle counter registers and
> > write them into the job's syncobj, and the other does not.
> > 
> > A profiled job's call sequence takes up two ring buffer slots, and this is
> > reflected when initialising the DRM scheduler for each queue, with a
> > profiled job contributing twice as many credits.
> > 
> > Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> 
> Thanks for the updates, this looks better. A few minor comments below.
> 
> > ---
> >  drivers/gpu/drm/panthor/panthor_device.h |   2 +
> >  drivers/gpu/drm/panthor/panthor_sched.c  | 244 ++++++++++++++++++++---
> >  2 files changed, 216 insertions(+), 30 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > index e388c0472ba7..3ede2f80df73 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.h
> > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > @@ -162,6 +162,8 @@ struct panthor_device {
> >  		 */
> >  		struct page *dummy_latest_flush;
> >  	} pm;
> > +
> > +	bool profile_mode;
> >  };
> >  
> >  /**
> > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > index 79ffcbc41d78..6438e5ea1f2b 100644
> > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > @@ -93,6 +93,9 @@
> >  #define MIN_CSGS				3
> >  #define MAX_CSG_PRIO				0xf
> >  
> > +#define NUM_INSTRS_PER_SLOT			16
> > +#define SLOTSIZE				(NUM_INSTRS_PER_SLOT * sizeof(u64))
> > +
> >  struct panthor_group;
> >  
> >  /**
> > @@ -466,6 +469,9 @@ struct panthor_queue {
> >  		 */
> >  		struct list_head in_flight_jobs;
> >  	} fence_ctx;
> > +
> > +	/** @time_offset: Offset of panthor_job_times structs in group's syncobj bo. */
> > +	unsigned long time_offset;
> 
> AFAICT this doesn't need to be stored. We could just pass this value
> into group_create_queue() as an extra parameter where it's used.

I think we need to keep this offset value around, because queues within the same group
could have a variable number of slots, so when fetching the sampled values from the
syncobjs BO in update_fdinfo_stats, it would have to traverse the entire array of
preceding queues and figure out their size in slots so as to jump over as many
struct panthor_job_times after the preceding syncobj array.

> >  };
> >  
> >  /**
> > @@ -592,7 +598,17 @@ struct panthor_group {
> >  	 * One sync object per queue. The position of the sync object is
> >  	 * determined by the queue index.
> >  	 */
> > -	struct panthor_kernel_bo *syncobjs;
> > +
> > +	struct {
> > +		/** @bo: Kernel BO holding the sync objects. */
> > +		struct panthor_kernel_bo *bo;
> > +
> > +		/**
> > +		 * @job_times_offset: Beginning of panthor_job_times struct samples after
> > +		 * the group's array of sync objects.
> > +		 */
> > +		size_t job_times_offset;
> > +	} syncobjs;
> >  
> >  	/** @state: Group state. */
> >  	enum panthor_group_state state;
> > @@ -651,6 +667,18 @@ struct panthor_group {
> >  	struct list_head wait_node;
> >  };
> >  
> > +struct panthor_job_times {
> > +	struct {
> > +		u64 before;
> > +		u64 after;
> > +	} cycles;
> > +
> > +	struct {
> > +		u64 before;
> > +		u64 after;
> > +	} time;
> > +};
> > +
> >  /**
> >   * group_queue_work() - Queue a group work
> >   * @group: Group to queue the work for.
> > @@ -730,6 +758,9 @@ struct panthor_job {
> >  	/** @queue_idx: Index of the queue inside @group. */
> >  	u32 queue_idx;
> >  
> > +	/** @ringbuf_idx: Index of the ringbuffer inside @queue. */
> > +	u32 ringbuf_idx;
> > +
> >  	/** @call_info: Information about the userspace command stream call. */
> >  	struct {
> >  		/** @start: GPU address of the userspace command stream. */
> > @@ -764,6 +795,9 @@ struct panthor_job {
> >  
> >  	/** @done_fence: Fence signaled when the job is finished or cancelled. */
> >  	struct dma_fence *done_fence;
> > +
> > +	/** @is_profiled: Whether timestamp and cycle numbers were gathered for this job */
> > +	bool is_profiled;
> >  };
> >  
> >  static void
> > @@ -844,7 +878,7 @@ static void group_release_work(struct work_struct *work)
> >  
> >  	panthor_kernel_bo_destroy(group->suspend_buf);
> >  	panthor_kernel_bo_destroy(group->protm_suspend_buf);
> > -	panthor_kernel_bo_destroy(group->syncobjs);
> > +	panthor_kernel_bo_destroy(group->syncobjs.bo);
> >  
> >  	panthor_vm_put(group->vm);
> >  	kfree(group);
> > @@ -1969,8 +2003,6 @@ tick_ctx_init(struct panthor_scheduler *sched,
> >  	}
> >  }
> >  
> > -#define NUM_INSTRS_PER_SLOT		16
> > -
> >  static void
> >  group_term_post_processing(struct panthor_group *group)
> >  {
> > @@ -2007,7 +2039,7 @@ group_term_post_processing(struct panthor_group *group)
> >  		spin_unlock(&queue->fence_ctx.lock);
> >  
> >  		/* Manually update the syncobj seqno to unblock waiters. */
> > -		syncobj = group->syncobjs->kmap + (i * sizeof(*syncobj));
> > +		syncobj = group->syncobjs.bo->kmap + (i * sizeof(*syncobj));
> >  		syncobj->status = ~0;
> >  		syncobj->seqno = atomic64_read(&queue->fence_ctx.seqno);
> >  		sched_queue_work(group->ptdev->scheduler, sync_upd);
> > @@ -2780,7 +2812,7 @@ static void group_sync_upd_work(struct work_struct *work)
> >  		if (!queue)
> >  			continue;
> >  
> > -		syncobj = group->syncobjs->kmap + (queue_idx * sizeof(*syncobj));
> > +		syncobj = group->syncobjs.bo->kmap + (queue_idx * sizeof(*syncobj));
> >  
> >  		spin_lock(&queue->fence_ctx.lock);
> >  		list_for_each_entry_safe(job, job_tmp, &queue->fence_ctx.in_flight_jobs, node) {
> > @@ -2815,22 +2847,81 @@ queue_run_job(struct drm_sched_job *sched_job)
> >  	struct panthor_scheduler *sched = ptdev->scheduler;
> >  	u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
> >  	u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1);
> > +	u32 ringbuf_index = ringbuf_insert / (SLOTSIZE);
> > +	bool ringbuf_wraparound =
> > +		job->is_profiled && ((ringbuf_size/SLOTSIZE) == ringbuf_index + 1);
> >  	u64 addr_reg = ptdev->csif_info.cs_reg_count -
> >  		       ptdev->csif_info.unpreserved_cs_reg_count;
> >  	u64 val_reg = addr_reg + 2;
> > -	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> > -			job->queue_idx * sizeof(struct panthor_syncobj_64b);
> > +	u64 cycle_reg = addr_reg;
> > +	u64 time_reg = val_reg;
> > +	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) +
> > +		job->queue_idx * sizeof(struct panthor_syncobj_64b);
> > +	u64 times_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) + queue->time_offset +
> > +		(ringbuf_index * sizeof(struct panthor_job_times));
> > +	size_t call_insrt_size;
> 
> NIT: s/insrt/instr/
> 
> > +	u64 *call_instrs;
> > +
> >  	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
> >  	struct dma_fence *done_fence;
> >  	int ret;
> >  
> > -	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
> > +	u64 call_instrs_simple[NUM_INSTRS_PER_SLOT] = {
> > +		/* MOV32 rX+2, cs.latest_flush */
> > +		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
> > +
> > +		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
> > +		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
> > +
> > +		/* MOV48 rX:rX+1, cs.start */
> > +		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
> > +
> > +		/* MOV32 rX+2, cs.size */
> > +		(2ull << 56) | (val_reg << 48) | job->call_info.size,
> > +
> > +		/* WAIT(0) => waits for FLUSH_CACHE2 instruction */
> > +		(3ull << 56) | (1 << 16),
> > +
> > +		/* CALL rX:rX+1, rX+2 */
> > +		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
> > +
> > +		/* MOV48 rX:rX+1, sync_addr */
> > +		(1ull << 56) | (addr_reg << 48) | sync_addr,
> > +
> > +		/* MOV48 rX+2, #1 */
> > +		(1ull << 56) | (val_reg << 48) | 1,
> > +
> > +		/* WAIT(all) */
> > +		(3ull << 56) | (waitall_mask << 16),
> > +
> > +		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
> > +		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
> > +
> > +		/* ERROR_BARRIER, so we can recover from faults at job
> > +		 * boundaries.
> > +		 */
> > +		(47ull << 56),
> > +	};
> > +
> > +	u64 call_instrs_profile[NUM_INSTRS_PER_SLOT*2] = {
> >  		/* MOV32 rX+2, cs.latest_flush */
> >  		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
> >  
> >  		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
> >  		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
> >  
> > +		/* MOV48 rX:rX+1, cycles_offset */
> > +		(1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.before)),
> > +
> > +		/* MOV48 rX:rX+1, time_offset */
> > +		(1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.before)),
> > +
> > +		/* STORE_STATE cycles */
> > +		(40ull << 56) |  (cycle_reg << 40) | (1ll << 32),
> > +
> > +		/* STORE_STATE timer */
> > +		(40ull << 56) |  (time_reg << 40) | (0ll << 32),
> > +
> >  		/* MOV48 rX:rX+1, cs.start */
> >  		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
> >  
> > @@ -2843,6 +2934,18 @@ queue_run_job(struct drm_sched_job *sched_job)
> >  		/* CALL rX:rX+1, rX+2 */
> >  		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
> >  
> > +		/* MOV48 rX:rX+1, cycles_offset */
> > +		(1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.after)),
> > +
> > +		/* MOV48 rX:rX+1, time_offset */
> > +		(1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.after)),
> > +
> > +		/* STORE_STATE cycles */
> > +		(40ull << 56) |  (cycle_reg << 40) | (1ll << 32),
> > +
> > +		/* STORE_STATE timer */
> > +		(40ull << 56) |  (time_reg << 40) | (0ll << 32),
> > +
> >  		/* MOV48 rX:rX+1, sync_addr */
> >  		(1ull << 56) | (addr_reg << 48) | sync_addr,
> >  
> > @@ -2862,9 +2965,18 @@ queue_run_job(struct drm_sched_job *sched_job)
> >  	};
> >  
> >  	/* Need to be cacheline aligned to please the prefetcher. */
> > -	static_assert(sizeof(call_instrs) % 64 == 0,
> > +	static_assert(sizeof(call_instrs_simple) % 64 == 0 && sizeof(call_instrs_profile) % 64 == 0,
> >  		      "call_instrs is not aligned on a cacheline");
> >  
> > +	if (job->is_profiled) {
> > +		call_instrs = call_instrs_profile;
> > +		call_insrt_size = sizeof(call_instrs_profile);
> > +
> > +	} else {
> > +		call_instrs = call_instrs_simple;
> > +		call_insrt_size = sizeof(call_instrs_simple);
> > +	}
> > +
> >  	/* Stream size is zero, nothing to do => return a NULL fence and let
> >  	 * drm_sched signal the parent.
> >  	 */
> > @@ -2887,8 +2999,23 @@ queue_run_job(struct drm_sched_job *sched_job)
> >  		       queue->fence_ctx.id,
> >  		       atomic64_inc_return(&queue->fence_ctx.seqno));
> >  
> > -	memcpy(queue->ringbuf->kmap + ringbuf_insert,
> > -	       call_instrs, sizeof(call_instrs));
> > +	/*
> > +	 * Need to handle the wrap-around case when copying profiled instructions
> > +	 * from an odd-indexed slot. The reason this can happen is user space is
> > +	 * able to control the profiling status of the driver through a sysfs
> > +	 * knob, so this might lead to a timestamp and cycles-profiling call
> > +	 * instruction stream beginning at an odd-number slot. The GPU should
> > +	 * be able to gracefully handle this.
> > +	 */
> > +	if (!ringbuf_wraparound) {
> > +		memcpy(queue->ringbuf->kmap + ringbuf_insert,
> > +		       call_instrs, call_insrt_size);
> > +	} else {
> > +		memcpy(queue->ringbuf->kmap + ringbuf_insert,
> > +		       call_instrs, call_insrt_size/2);
> > +		memcpy(queue->ringbuf->kmap, call_instrs +
> > +		       NUM_INSTRS_PER_SLOT, call_insrt_size/2);
> > +	}
> 
> A minor point, but I think it would just be easier to always do the copy
> in two parts:
> 
> 	memcpy(queue->ringbuf->kmap + ringbuf_insert,
> 	       call_instrs, NUM_INSTRS_PER_SLOT);
> 	if (profiling) {
> 		ringbuf_insert += NUM_INSTRS_PER_SLOT;
> 		ringbuf_insert &= (ringbuf_size - 1);
> 		memcpy(queue->ringbuf->kmap + ringbuf_insert,
> 		       call_instrs + NUM_INSTRS_PER_SLOT,
> 		       NUM_INSTRS_PER_SLOT);
> 	}
> 
> >  
> >  	panthor_job_get(&job->base);
> >  	spin_lock(&queue->fence_ctx.lock);
> > @@ -2896,7 +3023,8 @@ queue_run_job(struct drm_sched_job *sched_job)
> >  	spin_unlock(&queue->fence_ctx.lock);
> >  
> >  	job->ringbuf.start = queue->iface.input->insert;
> > -	job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
> > +	job->ringbuf.end = job->ringbuf.start + call_insrt_size;
> > +	job->ringbuf_idx = ringbuf_index;
> >  
> >  	/* Make sure the ring buffer is updated before the INSERT
> >  	 * register.
> > @@ -2987,7 +3115,8 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
> >  
> >  static struct panthor_queue *
> >  group_create_queue(struct panthor_group *group,
> > -		   const struct drm_panthor_queue_create *args)
> > +		   const struct drm_panthor_queue_create *args,
> > +		   unsigned int slots_so_far)
> 
> I'm not sure I like the name "slots_so_far", "slot_offset" maybe?
> Although my main gripe is below...

I agree, that would be better naming.

> >  {
> >  	struct drm_gpu_scheduler *drm_sched;
> >  	struct panthor_queue *queue;
> > @@ -3038,9 +3167,17 @@ group_create_queue(struct panthor_group *group,
> >  		goto err_free_queue;
> >  	}
> >  
> > +	queue->time_offset = group->syncobjs.job_times_offset +
> > +		(slots_so_far * sizeof(struct panthor_job_times));
> > +
> > +	/*
> > +	 * Credit limit argument tells us the total number of instructions
> > +	 * across all CS slots in the ringbuffer, with some jobs requiring
> > +	 * twice as many as others, depending on their profiling status.
> > +	 */
> >  	ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
> >  			     group->ptdev->scheduler->wq, 1,
> > -			     args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
> > +			     args->ringbuf_size / sizeof(u64),
> >  			     0, msecs_to_jiffies(JOB_TIMEOUT_MS),
> >  			     group->ptdev->reset.wq,
> >  			     NULL, "panthor-queue", group->ptdev->base.dev);
> > @@ -3068,7 +3205,9 @@ int panthor_group_create(struct panthor_file *pfile,
> >  	struct panthor_scheduler *sched = ptdev->scheduler;
> >  	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, 0);
> >  	struct panthor_group *group = NULL;
> > +	unsigned int total_slots;
> >  	u32 gid, i, suspend_size;
> > +	size_t syncobj_bo_size;
> >  	int ret;
> >  
> >  	if (group_args->pad)
> > @@ -3134,33 +3273,75 @@ int panthor_group_create(struct panthor_file *pfile,
> >  		goto err_put_group;
> >  	}
> >  
> > -	group->syncobjs = panthor_kernel_bo_create(ptdev, group->vm,
> > -						   group_args->queues.count *
> > -						   sizeof(struct panthor_syncobj_64b),
> > -						   DRM_PANTHOR_BO_NO_MMAP,
> > -						   DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> > -						   DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> > -						   PANTHOR_VM_KERNEL_AUTO_VA);
> > -	if (IS_ERR(group->syncobjs)) {
> > -		ret = PTR_ERR(group->syncobjs);
> > +	/*
> > +	 * Need to add size for the panthor_job_times structs, as many as the sum
> > +	 * of the number of job slots for every single queue ringbuffer.
> > +	 */
> > +	for (i = 0, total_slots = 0; i < group_args->queues.count; i++)
> > +		total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
> > +
> > +	syncobj_bo_size = (group_args->queues.count * sizeof(struct panthor_syncobj_64b))
> > +		+ (total_slots * sizeof(struct panthor_job_times));
> > +
> > +	/*
> > +	 * Memory layout of group's syncobjs BO
> > +	 * group->syncobjs.bo {
> > +	 *	struct panthor_syncobj_64b sync1;
> > +	 *	struct panthor_syncobj_64b sync2;
> > +	 *		...
> > +	 *		As many as group_args->queues.count
> > +	 *		...
> > +	 *	struct panthor_syncobj_64b syncn;
> > +	 *	struct panthor_job_times queue1_slot1
> > +	 *	struct panthor_job_times queue1_slot2
> > +	 *		...
> > +	 *		As many as queue[i].ringbuf_size / SLOTSIZE
> > +	 *		...
> > +	 *	struct panthor_job_times queue1_slotP
> > +	 *		...
> > +	 *		As many as group_args->queues.count
> > +	 *		...
> > +	 *	struct panthor_job_times queueN_slot1
> > +	 *	struct panthor_job_times queueN_slot2
> > +	 *		...
> > +	 *		As many as queue[n].ringbuf_size / SLOTSIZE
> > +	 *	struct panthor_job_times queueN_slotQ
> > +	 *
> > +	 *	Linearly, group->syncobjs.bo = {syncojb1,..,syncobjN,
> > +	 *	{queue1 = {js1,..,jsP},..,queueN = {js1,..,jsQ}}}
> > +	 * }
> > +	 *
> > +	 */
> > +
> > +	group->syncobjs.bo = panthor_kernel_bo_create(ptdev, group->vm,
> > +						      syncobj_bo_size,
> > +						      DRM_PANTHOR_BO_NO_MMAP,
> > +						      DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> > +						      DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> > +						      PANTHOR_VM_KERNEL_AUTO_VA);
> > +	if (IS_ERR(group->syncobjs.bo)) {
> > +		ret = PTR_ERR(group->syncobjs.bo);
> >  		goto err_put_group;
> >  	}
> >  
> > -	ret = panthor_kernel_bo_vmap(group->syncobjs);
> > +	ret = panthor_kernel_bo_vmap(group->syncobjs.bo);
> >  	if (ret)
> >  		goto err_put_group;
> >  
> > -	memset(group->syncobjs->kmap, 0,
> > -	       group_args->queues.count * sizeof(struct panthor_syncobj_64b));
> > +	memset(group->syncobjs.bo->kmap, 0, syncobj_bo_size);
> >  
> > -	for (i = 0; i < group_args->queues.count; i++) {
> > -		group->queues[i] = group_create_queue(group, &queue_args[i]);
> > +	group->syncobjs.job_times_offset =
> > +		group_args->queues.count * sizeof(struct panthor_syncobj_64b);
> > +
> > +	for (i = 0, total_slots = 0; i < group_args->queues.count; i++) {
> 
> I definitely don't like this usage of total_slots. Here "slots_so_far"
> would be a better name (which would justify the parameter name avoid),
> and would avoid the confusion that total_slots already had a value
> (which was indeed the total slots) but is now being reused for a running
> counter.

I agree, I'll change the naming in the next revision.

> Steve
> 
> > +		group->queues[i] = group_create_queue(group, &queue_args[i], total_slots);
> >  		if (IS_ERR(group->queues[i])) {
> >  			ret = PTR_ERR(group->queues[i]);
> >  			group->queues[i] = NULL;
> >  			goto err_put_group;
> >  		}
> >  
> > +		total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
> >  		group->queue_count++;
> >  	}
> >  
> > @@ -3384,9 +3565,12 @@ panthor_job_create(struct panthor_file *pfile,
> >  		goto err_put_job;
> >  	}
> >  
> > +	job->is_profiled = pfile->ptdev->profile_mode;
> > +
> >  	ret = drm_sched_job_init(&job->base,
> >  				 &job->group->queues[job->queue_idx]->entity,
> > -				 1, job->group);
> > +				 job->is_profiled ? NUM_INSTRS_PER_SLOT * 2 :
> > +				 NUM_INSTRS_PER_SLOT, job->group);
> 
> Is there a good reason to make the credits the number of instructions,
> rather than the number of slots? If we were going to count the actual
> number of non-NOP instructions then there would be some logic (although
> I'm not convinced that makes much sense), but considering we only allow
> submission in "slot granules" we might as well use that as the unit of
> "credit".

In my initial pre-ML version of the patch series I was passing the number of
queue slots as the total credit count, but Boris was keener on setting it to
the total number of instructions instead.

I agree with you that both are equivalent, one just being an integer multiple
of the other, so I'm fine with either choice. Maybe Boris can pitch in, in
case he has a strong opinion about this.

> Steve
> 
> >  	if (ret)
> >  		goto err_put_job;
> >  

Adrian Larumbe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-07-31 12:41     ` Adrián Larumbe
@ 2024-08-19  7:48       ` Steven Price
  2024-08-19 10:58         ` Boris Brezillon
  0 siblings, 1 reply; 13+ messages in thread
From: Steven Price @ 2024-08-19  7:48 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Boris Brezillon, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, kernel, dri-devel,
	linux-kernel

Hi Adrián,

On 31/07/2024 13:41, Adrián Larumbe wrote:
> Hi Steven, thanks for the remarks.
> 
> On 19.07.2024 15:14, Steven Price wrote:
>> On 16/07/2024 21:11, Adrián Larumbe wrote:
>>> Enable calculations of job submission times in clock cycles and wall
>>> time. This is done by expanding the boilerplate command stream when running
>>> a job to include instructions that compute said times right before an after
>>> a user CS.
>>>
>>> Those numbers are stored in the queue's group's sync objects BO, right
>>> after them. Because the queues in a group might have a different number of
>>> slots, one must keep track of the overall slot tally when reckoning the
>>> offset of a queue's time sample structs, one for each slot.
>>>
>>> This commit is done in preparation for enabling DRM fdinfo support in the
>>> Panthor driver, which depends on the numbers calculated herein.
>>>
>>> A profile mode device flag has been added that will in a future commit
>>> allow UM to toggle time sampling behaviour, which is disabled by default to
>>> save power. It also enables marking jobs as being profiled and picks one of
>>> two call instruction arrays to insert into the ring buffer. One of them
>>> includes FW logic to sample the timestamp and cycle counter registers and
>>> write them into the job's syncobj, and the other does not.
>>>
>>> A profiled job's call sequence takes up two ring buffer slots, and this is
>>> reflected when initialising the DRM scheduler for each queue, with a
>>> profiled job contributing twice as many credits.
>>>
>>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
>>
>> Thanks for the updates, this looks better. A few minor comments below.
>>
>>> ---
>>>  drivers/gpu/drm/panthor/panthor_device.h |   2 +
>>>  drivers/gpu/drm/panthor/panthor_sched.c  | 244 ++++++++++++++++++++---
>>>  2 files changed, 216 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
>>> index e388c0472ba7..3ede2f80df73 100644
>>> --- a/drivers/gpu/drm/panthor/panthor_device.h
>>> +++ b/drivers/gpu/drm/panthor/panthor_device.h
>>> @@ -162,6 +162,8 @@ struct panthor_device {
>>>  		 */
>>>  		struct page *dummy_latest_flush;
>>>  	} pm;
>>> +
>>> +	bool profile_mode;
>>>  };
>>>  
>>>  /**
>>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
>>> index 79ffcbc41d78..6438e5ea1f2b 100644
>>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
>>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
>>> @@ -93,6 +93,9 @@
>>>  #define MIN_CSGS				3
>>>  #define MAX_CSG_PRIO				0xf
>>>  
>>> +#define NUM_INSTRS_PER_SLOT			16
>>> +#define SLOTSIZE				(NUM_INSTRS_PER_SLOT * sizeof(u64))
>>> +
>>>  struct panthor_group;
>>>  
>>>  /**
>>> @@ -466,6 +469,9 @@ struct panthor_queue {
>>>  		 */
>>>  		struct list_head in_flight_jobs;
>>>  	} fence_ctx;
>>> +
>>> +	/** @time_offset: Offset of panthor_job_times structs in group's syncobj bo. */
>>> +	unsigned long time_offset;
>>
>> AFAICT this doesn't need to be stored. We could just pass this value
>> into group_create_queue() as an extra parameter where it's used.
> 
> I think we need to keep this offset value around, because queues within the same group
> could have a variable number of slots, so when fetching the sampled values from the
> syncobjs BO in update_fdinfo_stats, it would have to traverse the entire array of
> preceding queues and figure out their size in slots so as to jump over as many
> struct panthor_job_times after the preceding syncobj array.

Yes I think I was getting myself confused - for some reason I'd thought
the ring buffers in each queue were the same size. It makes sense to
keep this.

<snip>

>>> @@ -3384,9 +3565,12 @@ panthor_job_create(struct panthor_file *pfile,
>>>  		goto err_put_job;
>>>  	}
>>>  
>>> +	job->is_profiled = pfile->ptdev->profile_mode;
>>> +
>>>  	ret = drm_sched_job_init(&job->base,
>>>  				 &job->group->queues[job->queue_idx]->entity,
>>> -				 1, job->group);
>>> +				 job->is_profiled ? NUM_INSTRS_PER_SLOT * 2 :
>>> +				 NUM_INSTRS_PER_SLOT, job->group);
>>
>> Is there a good reason to make the credits the number of instructions,
>> rather than the number of slots? If we were going to count the actual
>> number of non-NOP instructions then there would be some logic (although
>> I'm not convinced that makes much sense), but considering we only allow
>> submission in "slot granules" we might as well use that as the unit of
>> "credit".
> 
> In my initial pre-ML version of the patch series I was passing the number of
> queue slots as the total credit count, but Boris was keener on setting it to
> the total number of instructions instead.
> 
> I agree with you that both are equivalent, one just being an integer multiple
> of the other, so I'm fine with either choice. Maybe Boris can pitch in, in
> case he has a strong opinion about this.

I wouldn't say I have a strong opinion, it just seems a little odd to be
multiplying the value by a constant everywhere. If you'd got some
changes planned where the instructions could vary more dynamically that
would be good to know about.

If Boris is "keener" on this approach then that's fine, we'll leave it
this way.

Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-08-19  7:48       ` Steven Price
@ 2024-08-19 10:58         ` Boris Brezillon
  0 siblings, 0 replies; 13+ messages in thread
From: Boris Brezillon @ 2024-08-19 10:58 UTC (permalink / raw)
  To: Steven Price
  Cc: Adrián Larumbe, Liviu Dudau, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter,
	kernel, dri-devel, linux-kernel

Hi,

On Mon, 19 Aug 2024 08:48:24 +0100
Steven Price <steven.price@arm.com> wrote:

> Hi Adrián,
> 
> On 31/07/2024 13:41, Adrián Larumbe wrote:
> > Hi Steven, thanks for the remarks.
> > 
> > On 19.07.2024 15:14, Steven Price wrote:  
> >> On 16/07/2024 21:11, Adrián Larumbe wrote:  
> >>> Enable calculations of job submission times in clock cycles and wall
> >>> time. This is done by expanding the boilerplate command stream when running
> >>> a job to include instructions that compute said times right before an after
> >>> a user CS.
> >>>
> >>> Those numbers are stored in the queue's group's sync objects BO, right
> >>> after them. Because the queues in a group might have a different number of
> >>> slots, one must keep track of the overall slot tally when reckoning the
> >>> offset of a queue's time sample structs, one for each slot.
> >>>
> >>> This commit is done in preparation for enabling DRM fdinfo support in the
> >>> Panthor driver, which depends on the numbers calculated herein.
> >>>
> >>> A profile mode device flag has been added that will in a future commit
> >>> allow UM to toggle time sampling behaviour, which is disabled by default to
> >>> save power. It also enables marking jobs as being profiled and picks one of
> >>> two call instruction arrays to insert into the ring buffer. One of them
> >>> includes FW logic to sample the timestamp and cycle counter registers and
> >>> write them into the job's syncobj, and the other does not.
> >>>
> >>> A profiled job's call sequence takes up two ring buffer slots, and this is
> >>> reflected when initialising the DRM scheduler for each queue, with a
> >>> profiled job contributing twice as many credits.
> >>>
> >>> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>  
> >>
> >> Thanks for the updates, this looks better. A few minor comments below.
> >>  
> >>> ---
> >>>  drivers/gpu/drm/panthor/panthor_device.h |   2 +
> >>>  drivers/gpu/drm/panthor/panthor_sched.c  | 244 ++++++++++++++++++++---
> >>>  2 files changed, 216 insertions(+), 30 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> >>> index e388c0472ba7..3ede2f80df73 100644
> >>> --- a/drivers/gpu/drm/panthor/panthor_device.h
> >>> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> >>> @@ -162,6 +162,8 @@ struct panthor_device {
> >>>  		 */
> >>>  		struct page *dummy_latest_flush;
> >>>  	} pm;
> >>> +
> >>> +	bool profile_mode;
> >>>  };
> >>>  
> >>>  /**
> >>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> >>> index 79ffcbc41d78..6438e5ea1f2b 100644
> >>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> >>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> >>> @@ -93,6 +93,9 @@
> >>>  #define MIN_CSGS				3
> >>>  #define MAX_CSG_PRIO				0xf
> >>>  
> >>> +#define NUM_INSTRS_PER_SLOT			16
> >>> +#define SLOTSIZE				(NUM_INSTRS_PER_SLOT * sizeof(u64))
> >>> +
> >>>  struct panthor_group;
> >>>  
> >>>  /**
> >>> @@ -466,6 +469,9 @@ struct panthor_queue {
> >>>  		 */
> >>>  		struct list_head in_flight_jobs;
> >>>  	} fence_ctx;
> >>> +
> >>> +	/** @time_offset: Offset of panthor_job_times structs in group's syncobj bo. */
> >>> +	unsigned long time_offset;  
> >>
> >> AFAICT this doesn't need to be stored. We could just pass this value
> >> into group_create_queue() as an extra parameter where it's used.  
> > 
> > I think we need to keep this offset value around, because queues within the same group
> > could have a variable number of slots, so when fetching the sampled values from the
> > syncobjs BO in update_fdinfo_stats, it would have to traverse the entire array of
> > preceding queues and figure out their size in slots so as to jump over as many
> > struct panthor_job_times after the preceding syncobj array.  
> 
> Yes I think I was getting myself confused - for some reason I'd thought
> the ring buffers in each queue were the same size. It makes sense to
> keep this.

IIRC, I was the one suggesting to put all metadata in a single BO to
avoid losing too much space when ring buffers are small (because of the
4k granularity of BOs). But given the size of panthor_job_times (32
bytes per slot), a 4k chunk only covers 128 job slots, which is not
that big to be honest. So I'm no longer sure this optimization makes
sense. If we still want to allocate one big BO containing syncobjs
and timestamp slots for all queues, say, to optimize the GEM metadata vs
actual BO data overhead, I'd recommend having a CPU/GPU pointer
relative to the queue in panthor_queue which we can easily dereference
with time_slots[job_slot_id].

> 
> <snip>
> 
> >>> @@ -3384,9 +3565,12 @@ panthor_job_create(struct panthor_file *pfile,
> >>>  		goto err_put_job;
> >>>  	}
> >>>  
> >>> +	job->is_profiled = pfile->ptdev->profile_mode;
> >>> +
> >>>  	ret = drm_sched_job_init(&job->base,
> >>>  				 &job->group->queues[job->queue_idx]->entity,
> >>> -				 1, job->group);
> >>> +				 job->is_profiled ? NUM_INSTRS_PER_SLOT * 2 :
> >>> +				 NUM_INSTRS_PER_SLOT, job->group);  
> >>
> >> Is there a good reason to make the credits the number of instructions,
> >> rather than the number of slots? If we were going to count the actual
> >> number of non-NOP instructions then there would be some logic (although
> >> I'm not convinced that makes much sense), but considering we only allow
> >> submission in "slot granules" we might as well use that as the unit of
> >> "credit".  
> > 
> > In my initial pre-ML version of the patch series I was passing the number of
> > queue slots as the total credit count, but Boris was keener on setting it to
> > the total number of instructions instead.
> > 
> > I agree with you that both are equivalent, one just being an integer multiple
> > of the other, so I'm fine with either choice. Maybe Boris can pitch in, in
> > case he has a strong opinion about this.  
> 
> I wouldn't say I have a strong opinion, it just seems a little odd to be
> multiplying the value by a constant everywhere. If you'd got some
> changes planned where the instructions could vary more dynamically that
> would be good to know about.

Yeah, the long term plan is to make the number of instructions variable
based on the profiling parameters, such that we don't lose ringbuf
slots when profiling is completely disabled. Of course we always want to
keep it a multiple of a cache-line (AKA 8 instructions), but I find it
easier to reason in number of instructions rather than block of
instructions.

Regards,

Boris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting
  2024-07-16 20:11 ` [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
  2024-07-19 14:14   ` Steven Price
@ 2024-08-19 12:22   ` Boris Brezillon
  1 sibling, 0 replies; 13+ messages in thread
From: Boris Brezillon @ 2024-08-19 12:22 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Steven Price, Liviu Dudau, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Daniel Vetter, kernel, dri-devel,
	linux-kernel

Hi Adrian,

Sorry for chiming so late, but I think I have a few concerns with this
patch. Nothing major, but I'd prefer to have it addressed now rather
than in a follow-up series.

On Tue, 16 Jul 2024 21:11:40 +0100
Adrián Larumbe <adrian.larumbe@collabora.com> wrote:

> Enable calculations of job submission times in clock cycles and wall
> time. This is done by expanding the boilerplate command stream when running
> a job to include instructions that compute said times right before an after
> a user CS.
> 
> Those numbers are stored in the queue's group's sync objects BO, right
> after them. Because the queues in a group might have a different number of
> slots, one must keep track of the overall slot tally when reckoning the
> offset of a queue's time sample structs, one for each slot.
> 
> This commit is done in preparation for enabling DRM fdinfo support in the
> Panthor driver, which depends on the numbers calculated herein.
> 
> A profile mode device flag has been added that will in a future commit
> allow UM to toggle time sampling behaviour, which is disabled by default to
> save power. It also enables marking jobs as being profiled and picks one of
> two call instruction arrays to insert into the ring buffer. One of them
> includes FW logic to sample the timestamp and cycle counter registers and
> write them into the job's syncobj, and the other does not.
> 
> A profiled job's call sequence takes up two ring buffer slots, and this is
> reflected when initialising the DRM scheduler for each queue, with a
> profiled job contributing twice as many credits.
> 
> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.h |   2 +
>  drivers/gpu/drm/panthor/panthor_sched.c  | 244 ++++++++++++++++++++---
>  2 files changed, 216 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index e388c0472ba7..3ede2f80df73 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -162,6 +162,8 @@ struct panthor_device {
>  		 */
>  		struct page *dummy_latest_flush;
>  	} pm;
> +
> +	bool profile_mode;

Should we make it a u32 bitmask, with a panthor_device_profiling_mode
enum defining all the flags, so we can easily add extra profiling flags
in the future?

>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index 79ffcbc41d78..6438e5ea1f2b 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -93,6 +93,9 @@
>  #define MIN_CSGS				3
>  #define MAX_CSG_PRIO				0xf
>  
> +#define NUM_INSTRS_PER_SLOT			16
> +#define SLOTSIZE				(NUM_INSTRS_PER_SLOT * sizeof(u64))
> +
>  struct panthor_group;
>  
>  /**
> @@ -466,6 +469,9 @@ struct panthor_queue {
>  		 */
>  		struct list_head in_flight_jobs;
>  	} fence_ctx;
> +
> +	/** @time_offset: Offset of panthor_job_times structs in group's syncobj bo. */
> +	unsigned long time_offset;

As mentioned in my other reply, it's probably simpler if you have a
profiling BO per queue, or, at the very least, if you have something
like that so you don't have to re-add the per-queue offset every time:

	struct {
		struct panthor_job_times *cpu;
		uint64_t gpu;
	} job_profiling_slots;

>  };
>  
>  /**
> @@ -592,7 +598,17 @@ struct panthor_group {
>  	 * One sync object per queue. The position of the sync object is
>  	 * determined by the queue index.
>  	 */
> -	struct panthor_kernel_bo *syncobjs;
> +
> +	struct {
> +		/** @bo: Kernel BO holding the sync objects. */
> +		struct panthor_kernel_bo *bo;
> +
> +		/**
> +		 * @job_times_offset: Beginning of panthor_job_times struct samples after
> +		 * the group's array of sync objects.
> +		 */
> +		size_t job_times_offset;
> +	} syncobjs;

You're about to add new stuff to the BO, so I don't think syncobjs is a
relevant name anymore.

>  
>  	/** @state: Group state. */
>  	enum panthor_group_state state;
> @@ -651,6 +667,18 @@ struct panthor_group {
>  	struct list_head wait_node;
>  };
>  
> +struct panthor_job_times {
> +	struct {
> +		u64 before;
> +		u64 after;
> +	} cycles;
> +
> +	struct {
> +		u64 before;
> +		u64 after;
> +	} time;
> +};
> +
>  /**
>   * group_queue_work() - Queue a group work
>   * @group: Group to queue the work for.
> @@ -730,6 +758,9 @@ struct panthor_job {
>  	/** @queue_idx: Index of the queue inside @group. */
>  	u32 queue_idx;
>  
> +	/** @ringbuf_idx: Index of the ringbuffer inside @queue. */
> +	u32 ringbuf_idx;

The name is a bit confusing, how about job_slot_idx? BTW, if the job
slot size is fixed (as seems to be implied by the SLOTSIZE definition)
and the number of instructions per job is a power of two, the slot index
can be extracted from panthor_job::call_info::start with something like:

static u32 get_job_slot_id(struct panthor_job *job)
{
	struct panthor_queue *queue =
		job->group->queues[job->queue_idx];
	u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
	u32 first_instr =
		panthor_job::call_info::start & (ringbuf_size - 1);

	return first_instr / NUM_INSTRS_PER_SLOT;
}

> +
>  	/** @call_info: Information about the userspace command stream call. */
>  	struct {
>  		/** @start: GPU address of the userspace command stream. */
> @@ -764,6 +795,9 @@ struct panthor_job {
>  
>  	/** @done_fence: Fence signaled when the job is finished or cancelled. */
>  	struct dma_fence *done_fence;
> +
> +	/** @is_profiled: Whether timestamp and cycle numbers were gathered for this job */
> +	bool is_profiled;

As mentioned above, I'm tempted to make it a bitmask so we can enable
more profiling modes in the future. If we do that, maybe we should make
cycle count and timestamp two different flags, even if we're likely to
enable both.

>  };
>  
>  static void
> @@ -844,7 +878,7 @@ static void group_release_work(struct work_struct *work)
>  
>  	panthor_kernel_bo_destroy(group->suspend_buf);
>  	panthor_kernel_bo_destroy(group->protm_suspend_buf);
> -	panthor_kernel_bo_destroy(group->syncobjs);
> +	panthor_kernel_bo_destroy(group->syncobjs.bo);
>  
>  	panthor_vm_put(group->vm);
>  	kfree(group);
> @@ -1969,8 +2003,6 @@ tick_ctx_init(struct panthor_scheduler *sched,
>  	}
>  }
>  
> -#define NUM_INSTRS_PER_SLOT		16
> -
>  static void
>  group_term_post_processing(struct panthor_group *group)
>  {
> @@ -2007,7 +2039,7 @@ group_term_post_processing(struct panthor_group *group)
>  		spin_unlock(&queue->fence_ctx.lock);
>  
>  		/* Manually update the syncobj seqno to unblock waiters. */
> -		syncobj = group->syncobjs->kmap + (i * sizeof(*syncobj));
> +		syncobj = group->syncobjs.bo->kmap + (i * sizeof(*syncobj));
>  		syncobj->status = ~0;
>  		syncobj->seqno = atomic64_read(&queue->fence_ctx.seqno);
>  		sched_queue_work(group->ptdev->scheduler, sync_upd);
> @@ -2780,7 +2812,7 @@ static void group_sync_upd_work(struct work_struct *work)
>  		if (!queue)
>  			continue;
>  
> -		syncobj = group->syncobjs->kmap + (queue_idx * sizeof(*syncobj));
> +		syncobj = group->syncobjs.bo->kmap + (queue_idx * sizeof(*syncobj));
>  
>  		spin_lock(&queue->fence_ctx.lock);
>  		list_for_each_entry_safe(job, job_tmp, &queue->fence_ctx.in_flight_jobs, node) {
> @@ -2815,22 +2847,81 @@ queue_run_job(struct drm_sched_job *sched_job)
>  	struct panthor_scheduler *sched = ptdev->scheduler;
>  	u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
>  	u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1);
> +	u32 ringbuf_index = ringbuf_insert / (SLOTSIZE);
> +	bool ringbuf_wraparound =
> +		job->is_profiled && ((ringbuf_size/SLOTSIZE) == ringbuf_index + 1);
>  	u64 addr_reg = ptdev->csif_info.cs_reg_count -
>  		       ptdev->csif_info.unpreserved_cs_reg_count;
>  	u64 val_reg = addr_reg + 2;
> -	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> -			job->queue_idx * sizeof(struct panthor_syncobj_64b);
> +	u64 cycle_reg = addr_reg;
> +	u64 time_reg = val_reg;
> +	u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) +
> +		job->queue_idx * sizeof(struct panthor_syncobj_64b);
> +	u64 times_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) + queue->time_offset +
> +		(ringbuf_index * sizeof(struct panthor_job_times));
> +	size_t call_insrt_size;
> +	u64 *call_instrs;
> +
>  	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
>  	struct dma_fence *done_fence;
>  	int ret;
>  
> -	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
> +	u64 call_instrs_simple[NUM_INSTRS_PER_SLOT] = {
> +		/* MOV32 rX+2, cs.latest_flush */
> +		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
> +
> +		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
> +		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
> +
> +		/* MOV48 rX:rX+1, cs.start */
> +		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
> +
> +		/* MOV32 rX+2, cs.size */
> +		(2ull << 56) | (val_reg << 48) | job->call_info.size,
> +
> +		/* WAIT(0) => waits for FLUSH_CACHE2 instruction */
> +		(3ull << 56) | (1 << 16),
> +
> +		/* CALL rX:rX+1, rX+2 */
> +		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
> +
> +		/* MOV48 rX:rX+1, sync_addr */
> +		(1ull << 56) | (addr_reg << 48) | sync_addr,
> +
> +		/* MOV48 rX+2, #1 */
> +		(1ull << 56) | (val_reg << 48) | 1,
> +
> +		/* WAIT(all) */
> +		(3ull << 56) | (waitall_mask << 16),
> +
> +		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
> +		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
> +
> +		/* ERROR_BARRIER, so we can recover from faults at job
> +		 * boundaries.
> +		 */
> +		(47ull << 56),
> +	};
> +
> +	u64 call_instrs_profile[NUM_INSTRS_PER_SLOT*2] = {

Looks like I was wrong, NUM_INSTRS_PER_SLOT is not necessarily the
number of instruction per job, it's just the granularity of your
ring buffer.

This being said, I'm not a huge fan of how the specialization is done
here, as we're basically duplicating call_instrs_simple, and making it
error prone for any future fixes we might do on call_instrs_simple (it's
very easy to omit porting the fix to call_instrs_profile).

How about we define:

#define MAX_INSTRS_PER_JOB 32

struct panthor_job_ringbuf_instrs {
	u64 buffer[MAX_INSTRS_PER_JOB];
	u32 count;
};

static void
push_instr(struct panthor_job_ringbuf_instrs *instrs, u64 instr)
{
	if (WARN_ON(instrs->count >= ARRAY_SIZE(instrs->buffer))) {
		instrs->count = UINT32_MAX;
		return;
	}

	instrs->buffer[instrs->count++] = instr;
}

Then you can emit the profiling prologue/epilogue and CS call sections
using dedicated emit_{profiling_prologue,profiling_epilogue,cs_call}()
helpers.

>  		/* MOV32 rX+2, cs.latest_flush */
>  		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
>  
>  		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
>  		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
>  
> +		/* MOV48 rX:rX+1, cycles_offset */
> +		(1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.before)),
> +
> +		/* MOV48 rX:rX+1, time_offset */
> +		(1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.before)),
> +
> +		/* STORE_STATE cycles */
> +		(40ull << 56) |  (cycle_reg << 40) | (1ll << 32),
> +
> +		/* STORE_STATE timer */
> +		(40ull << 56) |  (time_reg << 40) | (0ll << 32),
> +
>  		/* MOV48 rX:rX+1, cs.start */
>  		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
>  
> @@ -2843,6 +2934,18 @@ queue_run_job(struct drm_sched_job *sched_job)
>  		/* CALL rX:rX+1, rX+2 */
>  		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
>  
> +		/* MOV48 rX:rX+1, cycles_offset */
> +		(1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.after)),
> +
> +		/* MOV48 rX:rX+1, time_offset */
> +		(1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.after)),
> +
> +		/* STORE_STATE cycles */
> +		(40ull << 56) |  (cycle_reg << 40) | (1ll << 32),
> +
> +		/* STORE_STATE timer */
> +		(40ull << 56) |  (time_reg << 40) | (0ll << 32),
> +
>  		/* MOV48 rX:rX+1, sync_addr */
>  		(1ull << 56) | (addr_reg << 48) | sync_addr,
>  
> @@ -2862,9 +2965,18 @@ queue_run_job(struct drm_sched_job *sched_job)
>  	};
>  
>  	/* Need to be cacheline aligned to please the prefetcher. */
> -	static_assert(sizeof(call_instrs) % 64 == 0,
> +	static_assert(sizeof(call_instrs_simple) % 64 == 0 && sizeof(call_instrs_profile) % 64 == 0,
>  		      "call_instrs is not aligned on a cacheline");
>  
> +	if (job->is_profiled) {
> +		call_instrs = call_instrs_profile;
> +		call_insrt_size = sizeof(call_instrs_profile);
> +
> +	} else {
> +		call_instrs = call_instrs_simple;
> +		call_insrt_size = sizeof(call_instrs_simple);
> +	}
> +
>  	/* Stream size is zero, nothing to do => return a NULL fence and let
>  	 * drm_sched signal the parent.
>  	 */
> @@ -2887,8 +2999,23 @@ queue_run_job(struct drm_sched_job *sched_job)
>  		       queue->fence_ctx.id,
>  		       atomic64_inc_return(&queue->fence_ctx.seqno));
>  
> -	memcpy(queue->ringbuf->kmap + ringbuf_insert,
> -	       call_instrs, sizeof(call_instrs));
> +	/*
> +	 * Need to handle the wrap-around case when copying profiled instructions
> +	 * from an odd-indexed slot. The reason this can happen is user space is
> +	 * able to control the profiling status of the driver through a sysfs
> +	 * knob, so this might lead to a timestamp and cycles-profiling call
> +	 * instruction stream beginning at an odd-number slot. The GPU should
> +	 * be able to gracefully handle this.

NUM_INSTRS_PER_SLOT*2 is still a power of two, so that shouldn't be a
problem until we need more than 32 instructions. Note that this
wraparound handling might be interesting to have if we make the
granularity 8 instructions (a cache-line), and the aligned amount of
instructions is not a power of two.

> +	 */
> +	if (!ringbuf_wraparound) {
> +		memcpy(queue->ringbuf->kmap + ringbuf_insert,
> +		       call_instrs, call_insrt_size);
> +	} else {
> +		memcpy(queue->ringbuf->kmap + ringbuf_insert,
> +		       call_instrs, call_insrt_size/2);
> +		memcpy(queue->ringbuf->kmap, call_instrs +
> +		       NUM_INSTRS_PER_SLOT, call_insrt_size/2);

Uh, a lot of assumptions on SLOTSIZE are made here. I'd rather have
a copy_instrs_to_ringbuf() that does all of the wraparound handling in
a generic way based on the current position, the ringbuf size and the
size to copy.

static void
copy_instrs_to_ringbuf(struct panthor_queue *queue,
		       struct panthor_job_ringbuf_instrs *instrs)
{
	u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
	u32 start = panthor_job::call_info::start & (ringbuf_size - 1);
	u32 size;

	/* Make sure things are aligned on a cache line */
	size = ALIGN_TO(instrs->count * sizeof(u64), 64);

	if (start + size > ringbuf_size) {
		memcpy(queue->ringbuf->kmap + start,
		       ringbuf_size - start);
		memcpy(queue->ringbuf->kmap,
		       start + size - ringbuf_size);
	} else {
		memcpy(queue->ringbuf->kmap + start, size);
	}
}

> +	}
>  
>  	panthor_job_get(&job->base);
>  	spin_lock(&queue->fence_ctx.lock);
> @@ -2896,7 +3023,8 @@ queue_run_job(struct drm_sched_job *sched_job)
>  	spin_unlock(&queue->fence_ctx.lock);
>  
>  	job->ringbuf.start = queue->iface.input->insert;
> -	job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
> +	job->ringbuf.end = job->ringbuf.start + call_insrt_size;
> +	job->ringbuf_idx = ringbuf_index;
>  
>  	/* Make sure the ring buffer is updated before the INSERT
>  	 * register.
> @@ -2987,7 +3115,8 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>  
>  static struct panthor_queue *
>  group_create_queue(struct panthor_group *group,
> -		   const struct drm_panthor_queue_create *args)
> +		   const struct drm_panthor_queue_create *args,
> +		   unsigned int slots_so_far)
>  {
>  	struct drm_gpu_scheduler *drm_sched;
>  	struct panthor_queue *queue;
> @@ -3038,9 +3167,17 @@ group_create_queue(struct panthor_group *group,
>  		goto err_free_queue;
>  	}
>  
> +	queue->time_offset = group->syncobjs.job_times_offset +
> +		(slots_so_far * sizeof(struct panthor_job_times));

Let's just pass an explicit GPU/CPU pointer to group_create_queue(), or
allocate a BO per-queue.

> +
> +	/*
> +	 * Credit limit argument tells us the total number of instructions
> +	 * across all CS slots in the ringbuffer, with some jobs requiring
> +	 * twice as many as others, depending on their profiling status.
> +	 */
>  	ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>  			     group->ptdev->scheduler->wq, 1,
> -			     args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
> +			     args->ringbuf_size / sizeof(u64),
>  			     0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>  			     group->ptdev->reset.wq,
>  			     NULL, "panthor-queue", group->ptdev->base.dev);
> @@ -3068,7 +3205,9 @@ int panthor_group_create(struct panthor_file *pfile,
>  	struct panthor_scheduler *sched = ptdev->scheduler;
>  	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, 0);
>  	struct panthor_group *group = NULL;
> +	unsigned int total_slots;
>  	u32 gid, i, suspend_size;
> +	size_t syncobj_bo_size;
>  	int ret;
>  
>  	if (group_args->pad)
> @@ -3134,33 +3273,75 @@ int panthor_group_create(struct panthor_file *pfile,
>  		goto err_put_group;
>  	}
>  
> -	group->syncobjs = panthor_kernel_bo_create(ptdev, group->vm,
> -						   group_args->queues.count *
> -						   sizeof(struct panthor_syncobj_64b),
> -						   DRM_PANTHOR_BO_NO_MMAP,
> -						   DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> -						   DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> -						   PANTHOR_VM_KERNEL_AUTO_VA);
> -	if (IS_ERR(group->syncobjs)) {
> -		ret = PTR_ERR(group->syncobjs);
> +	/*
> +	 * Need to add size for the panthor_job_times structs, as many as the sum
> +	 * of the number of job slots for every single queue ringbuffer.
> +	 */
> +	for (i = 0, total_slots = 0; i < group_args->queues.count; i++)
> +		total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
> +
> +	syncobj_bo_size = (group_args->queues.count * sizeof(struct panthor_syncobj_64b))
> +		+ (total_slots * sizeof(struct panthor_job_times));
> +
> +	/*
> +	 * Memory layout of group's syncobjs BO
> +	 * group->syncobjs.bo {
> +	 *	struct panthor_syncobj_64b sync1;
> +	 *	struct panthor_syncobj_64b sync2;
> +	 *		...
> +	 *		As many as group_args->queues.count
> +	 *		...
> +	 *	struct panthor_syncobj_64b syncn;
> +	 *	struct panthor_job_times queue1_slot1
> +	 *	struct panthor_job_times queue1_slot2
> +	 *		...
> +	 *		As many as queue[i].ringbuf_size / SLOTSIZE
> +	 *		...
> +	 *	struct panthor_job_times queue1_slotP
> +	 *		...
> +	 *		As many as group_args->queues.count
> +	 *		...
> +	 *	struct panthor_job_times queueN_slot1
> +	 *	struct panthor_job_times queueN_slot2
> +	 *		...
> +	 *		As many as queue[n].ringbuf_size / SLOTSIZE
> +	 *	struct panthor_job_times queueN_slotQ
> +	 *
> +	 *	Linearly, group->syncobjs.bo = {syncojb1,..,syncobjN,
> +	 *	{queue1 = {js1,..,jsP},..,queueN = {js1,..,jsQ}}}
> +	 * }
> +	 *
> +	 */
> +
> +	group->syncobjs.bo = panthor_kernel_bo_create(ptdev, group->vm,
> +						      syncobj_bo_size,
> +						      DRM_PANTHOR_BO_NO_MMAP,
> +						      DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> +						      DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> +						      PANTHOR_VM_KERNEL_AUTO_VA);
> +	if (IS_ERR(group->syncobjs.bo)) {
> +		ret = PTR_ERR(group->syncobjs.bo);
>  		goto err_put_group;
>  	}
>  
> -	ret = panthor_kernel_bo_vmap(group->syncobjs);
> +	ret = panthor_kernel_bo_vmap(group->syncobjs.bo);
>  	if (ret)
>  		goto err_put_group;
>  
> -	memset(group->syncobjs->kmap, 0,
> -	       group_args->queues.count * sizeof(struct panthor_syncobj_64b));
> +	memset(group->syncobjs.bo->kmap, 0, syncobj_bo_size);
>  
> -	for (i = 0; i < group_args->queues.count; i++) {
> -		group->queues[i] = group_create_queue(group, &queue_args[i]);
> +	group->syncobjs.job_times_offset =
> +		group_args->queues.count * sizeof(struct panthor_syncobj_64b);
> +
> +	for (i = 0, total_slots = 0; i < group_args->queues.count; i++) {
> +		group->queues[i] = group_create_queue(group, &queue_args[i], total_slots);
>  		if (IS_ERR(group->queues[i])) {
>  			ret = PTR_ERR(group->queues[i]);
>  			group->queues[i] = NULL;
>  			goto err_put_group;
>  		}
>  
> +		total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
>  		group->queue_count++;
>  	}
>  
> @@ -3384,9 +3565,12 @@ panthor_job_create(struct panthor_file *pfile,
>  		goto err_put_job;
>  	}
>  
> +	job->is_profiled = pfile->ptdev->profile_mode;
> +
>  	ret = drm_sched_job_init(&job->base,
>  				 &job->group->queues[job->queue_idx]->entity,
> -				 1, job->group);
> +				 job->is_profiled ? NUM_INSTRS_PER_SLOT * 2 :
> +				 NUM_INSTRS_PER_SLOT, job->group);

Can we have an helper calculating the job credits instead of hardcoding
it here? This way, once we decide to make things more dynamic, we only
have one place to patch.

>  	if (ret)
>  		goto err_put_job;
>  


Regards,

Boris

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-08-19 12:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-16 20:11 [PATCH v4 0/4] Support fdinfo runtime and memory stats on Panthor Adrián Larumbe
2024-07-16 20:11 ` [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting Adrián Larumbe
2024-07-19 14:14   ` Steven Price
2024-07-31 12:41     ` Adrián Larumbe
2024-08-19  7:48       ` Steven Price
2024-08-19 10:58         ` Boris Brezillon
2024-08-19 12:22   ` Boris Brezillon
2024-07-16 20:11 ` [PATCH v4 2/4] drm/panthor: add DRM fdinfo support Adrián Larumbe
2024-07-19 14:14   ` Steven Price
2024-07-16 20:11 ` [PATCH v4 3/4] drm/panthor: enable fdinfo for memory stats Adrián Larumbe
2024-07-19 14:14   ` Steven Price
2024-07-16 20:11 ` [PATCH v4 4/4] drm/panthor: add sysfs knob for enabling job profiling Adrián Larumbe
2024-07-19 14:15   ` Steven Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox