From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from bali.collaboradmins.com (bali.collaboradmins.com [148.251.105.195]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2647F15573D; Wed, 4 Sep 2024 07:49:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.251.105.195 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725436165; cv=none; b=u9kr7WhWOmk9hHAj7ja8UFriOb5w9ruYVOsvxtnH1IVxVK6MJHitHF1BgGOFqwzQ/338OXLTi3IIxp0wOVo8worf7vU8C/lYR6YRZDwJOOqtbh3gsM8Hpzthl2elNHonX9vP0k1YYum8Sy2hQsaICfCEiX+z6/15kL2wi+ErgFQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725436165; c=relaxed/simple; bh=OED4sxa3rjYFiAWeWy9Q2YdnCP2YVqKB7b0Rc6pNONE=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=StShDSGzf1SWoqDlUzsrHM5XkAhREPzvKJ0mM2gyBXZpirTAdDMS6tq8KcOcX9wK5LT+e9YddlnXlQLUUdwg8oZmREaRlM5itXEIUyVtGb8rXr3CBHJ+gTMQq/VgtfSU81DukCGDdJcF6HWuJrFYI9/j3QtAQIxgBU8bUou1S34= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=collabora.com; spf=pass smtp.mailfrom=collabora.com; dkim=pass (2048-bit key) header.d=collabora.com header.i=@collabora.com header.b=cf8OetjE; arc=none smtp.client-ip=148.251.105.195 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=collabora.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=collabora.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=collabora.com header.i=@collabora.com header.b="cf8OetjE" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1725436161; bh=OED4sxa3rjYFiAWeWy9Q2YdnCP2YVqKB7b0Rc6pNONE=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=cf8OetjENdIZ9FeZSpqyQr4UEl+P54QZPE2tkUEOUjzUBnYKAsACaHrOZggEvs0fU /BzUUMOu0DPsgXkvLomyWiGcyw3iUrtC25NizJQTW9iT346c20Qg7qmWKXiOu3FOYo CRKtnApH9hUsX1NbI+eDtaEDQMbFcHnIf5bvCQxL/Rp04ttVXn5nfomwsxSRZba42X 7rPvdSXqlPKolqCaquoOuCYEsiwLoQpnpOHifXeXz1JFWTfS7L//o1mH/8xieanuO9 Jl3k/g1SRUBRZekpoajkbzYuCRSHVUYESHP3pvZMfOxaIwz4ZGTjNBnMa2dvDr50Zu Cj1cDTbYLw05g== Received: from localhost (unknown [IPv6:2a01:e0a:2c:6930:5cf4:84a1:2763:fe0d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: bbrezillon) by bali.collaboradmins.com (Postfix) with ESMTPSA id 6B48617E0F92; Wed, 4 Sep 2024 09:49:20 +0200 (CEST) Date: Wed, 4 Sep 2024 09:49:15 +0200 From: Boris Brezillon To: =?UTF-8?B?QWRyacOhbg==?= Larumbe Cc: Steven Price , Liviu Dudau , Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann , David Airlie , Daniel Vetter , Sumit Semwal , Christian =?UTF-8?B?S8O2bmln?= , kernel@collabora.com, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, linaro-mm-sig@lists.linaro.org Subject: Re: [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting Message-ID: <20240904094915.1d92661d@collabora.com> In-Reply-To: <20240903202541.430225-2-adrian.larumbe@collabora.com> References: <20240903202541.430225-1-adrian.larumbe@collabora.com> <20240903202541.430225-2-adrian.larumbe@collabora.com> Organization: Collabora X-Mailer: Claws Mail 4.3.0 (GTK 3.24.43; x86_64-redhat-linux-gnu) Precedence: bulk X-Mailing-List: linux-media@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Tue, 3 Sep 2024 21:25:35 +0100 Adri=C3=A1n Larumbe wrote: > Enable calculations of job submission times in clock cycles and wall > time. This is done by expanding the boilerplate command stream when runni= ng > a job to include instructions that compute said times right before an aft= er > a user CS. >=20 > A separate kernel BO is created per queue to store those values. Jobs can > access their sampled data through a slots buffer-specific index different > from that of the queue's ringbuffer. The reason for this is saving memory > on the profiling information kernel BO, since the amount of simultaneous > profiled jobs we can write into the queue's ringbuffer might be much > smaller than for regular jobs, as the former take more CSF instructions. >=20 > This commit is done in preparation for enabling DRM fdinfo support in the > Panthor driver, which depends on the numbers calculated herein. >=20 > A profile mode mask has been added that will in a future commit allow UM = to > toggle performance metric sampling behaviour, which is disabled by default > to save power. When a ringbuffer CS is constructed, timestamp and cycling > sampling instructions are added depending on the enabled flags in the > profiling mask. >=20 > A helper was provided that calculates the number of instructions for a > given set of enablement mask, and these are passed as the number of credi= ts > when initialising a DRM scheduler job. >=20 > Signed-off-by: Adri=C3=A1n Larumbe > --- > drivers/gpu/drm/panthor/panthor_device.h | 22 ++ > drivers/gpu/drm/panthor/panthor_sched.c | 327 ++++++++++++++++++++--- > 2 files changed, 305 insertions(+), 44 deletions(-) >=20 > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/p= anthor/panthor_device.h > index e388c0472ba7..a48e30d0af30 100644 > --- a/drivers/gpu/drm/panthor/panthor_device.h > +++ b/drivers/gpu/drm/panthor/panthor_device.h > @@ -66,6 +66,25 @@ struct panthor_irq { > atomic_t suspended; > }; > =20 > +/** > + * enum panthor_device_profiling_mode - Profiling state > + */ > +enum panthor_device_profiling_flags { > + /** @PANTHOR_DEVICE_PROFILING_DISABLED: Profiling is disabled. */ > + PANTHOR_DEVICE_PROFILING_DISABLED =3D 0, > + > + /** @PANTHOR_DEVICE_PROFILING_CYCLES: Sampling job cycles. */ > + PANTHOR_DEVICE_PROFILING_CYCLES =3D BIT(0), > + > + /** @PANTHOR_DEVICE_PROFILING_TIMESTAMP: Sampling job timestamp. */ > + PANTHOR_DEVICE_PROFILING_TIMESTAMP =3D BIT(1), > + > + /** @PANTHOR_DEVICE_PROFILING_ALL: Sampling everything. */ > + PANTHOR_DEVICE_PROFILING_ALL =3D > + PANTHOR_DEVICE_PROFILING_CYCLES | > + PANTHOR_DEVICE_PROFILING_TIMESTAMP, > +}; > + > /** > * struct panthor_device - Panthor device > */ > @@ -162,6 +181,9 @@ struct panthor_device { > */ > struct page *dummy_latest_flush; > } pm; > + > + /** @profile_mask: User-set profiling flags for job accounting. */ > + u32 profile_mask; > }; > =20 > /** > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/pa= nthor/panthor_sched.c > index c426a392b081..b087648bf59a 100644 > --- a/drivers/gpu/drm/panthor/panthor_sched.c > +++ b/drivers/gpu/drm/panthor/panthor_sched.c > @@ -93,6 +93,9 @@ > #define MIN_CSGS 3 > #define MAX_CSG_PRIO 0xf > =20 > +#define NUM_INSTRS_PER_CACHE_LINE (64 / sizeof(u64)) > +#define MAX_INSTRS_PER_JOB 32 > + > struct panthor_group; > =20 > /** > @@ -476,6 +479,18 @@ struct panthor_queue { > */ > struct list_head in_flight_jobs; > } fence_ctx; > + > + /** @profiling_info: Job profiling data slots and access information. */ > + struct { > + /** @slots: Kernel BO holding the slots. */ > + struct panthor_kernel_bo *slots; > + > + /** @slot_count: Number of jobs ringbuffer can hold at once. */ > + u32 slot_count; > + > + /** @profiling_seqno: Index of the next available profiling informatio= n slot. */ > + u32 profiling_seqno; Nit: no need to repeat profiling as it's under the profiling_info struct. I would simply name that one 'seqno'. > + } profiling_info; s/profiling_info/profiling/ ? > }; > =20 > /** > @@ -661,6 +676,18 @@ struct panthor_group { > struct list_head wait_node; > }; > =20 > +struct panthor_job_profiling_data { > + struct { > + u64 before; > + u64 after; > + } cycles; > + > + struct { > + u64 before; > + u64 after; > + } time; > +}; > + > /** > * group_queue_work() - Queue a group work > * @group: Group to queue the work for. > @@ -774,6 +801,12 @@ struct panthor_job { > =20 > /** @done_fence: Fence signaled when the job is finished or cancelled. = */ > struct dma_fence *done_fence; > + > + /** @profile_mask: Current device job profiling enablement bitmask. */ > + u32 profile_mask; > + > + /** @profile_slot: Job profiling information index in the profiling slo= ts BO. */ > + u32 profiling_slot; Nit: we tend to group fields together under sub-structs, so I'd say: struct { u32 mask; // or flags u32 slot; } profiling; > }; > =20 > static void > @@ -838,6 +871,7 @@ static void group_free_queue(struct panthor_group *gr= oup, struct panthor_queue * > =20 > panthor_kernel_bo_destroy(queue->ringbuf); > panthor_kernel_bo_destroy(queue->iface.mem); > + panthor_kernel_bo_destroy(queue->profiling_info.slots); > =20 > /* Release the last_fence we were holding, if any. */ > dma_fence_put(queue->fence_ctx.last_fence); > @@ -1982,8 +2016,6 @@ tick_ctx_init(struct panthor_scheduler *sched, > } > } > =20 > -#define NUM_INSTRS_PER_SLOT 16 > - > static void > group_term_post_processing(struct panthor_group *group) > { > @@ -2815,65 +2847,211 @@ static void group_sync_upd_work(struct work_stru= ct *work) > group_put(group); > } > =20 > -static struct dma_fence * > -queue_run_job(struct drm_sched_job *sched_job) > +struct panthor_job_ringbuf_instrs { > + u64 buffer[MAX_INSTRS_PER_JOB]; > + u32 count; > +}; > + > +struct panthor_job_instr { > + u32 profile_mask; > + u64 instr; > +}; > + > +#define JOB_INSTR(__prof, __instr) \ > + { \ > + .profile_mask =3D __prof, \ > + .instr =3D __instr, \ > + } > + > +static void > +copy_instrs_to_ringbuf(struct panthor_queue *queue, > + struct panthor_job *job, > + struct panthor_job_ringbuf_instrs *instrs) > +{ > + ssize_t ringbuf_size =3D panthor_kernel_bo_size(queue->ringbuf); > + u32 start =3D job->ringbuf.start & (ringbuf_size - 1); > + ssize_t size, written; > + > + /* > + * We need to write a whole slot, including any trailing zeroes > + * that may come at the end of it. Also, because instrs.buffer had > + * been zero-initialised, there's no need to pad it with 0's > + */ > + instrs->count =3D ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE); > + size =3D instrs->count * sizeof(u64); > + written =3D min(ringbuf_size - start, size); > + > + memcpy(queue->ringbuf->kmap + start, instrs->buffer, written); > + > + if (written < size) > + memcpy(queue->ringbuf->kmap, > + &instrs->buffer[(ringbuf_size - start)/sizeof(u64)], > + size - written); > +} > + > +struct panthor_job_cs_params { > + u32 profile_mask; > + u64 addr_reg; u64 val_reg; > + u64 cycle_reg; u64 time_reg; > + u64 sync_addr; u64 times_addr; > + u64 cs_start; u64 cs_size; > + u32 last_flush; u32 waitall_mask; > +}; > + > +static void > +get_job_cs_params(struct panthor_job *job, struct panthor_job_cs_params = *params) > { > - struct panthor_job *job =3D container_of(sched_job, struct panthor_job,= base); > struct panthor_group *group =3D job->group; > struct panthor_queue *queue =3D group->queues[job->queue_idx]; > struct panthor_device *ptdev =3D group->ptdev; > struct panthor_scheduler *sched =3D ptdev->scheduler; > - u32 ringbuf_size =3D panthor_kernel_bo_size(queue->ringbuf); > - u32 ringbuf_insert =3D queue->iface.input->insert & (ringbuf_size - 1); > - u64 addr_reg =3D ptdev->csif_info.cs_reg_count - > - ptdev->csif_info.unpreserved_cs_reg_count; > - u64 val_reg =3D addr_reg + 2; > - u64 sync_addr =3D panthor_kernel_bo_gpuva(group->syncobjs) + > - job->queue_idx * sizeof(struct panthor_syncobj_64b); > - u32 waitall_mask =3D GENMASK(sched->sb_slot_count - 1, 0); > - struct dma_fence *done_fence; > - int ret; > =20 > - u64 call_instrs[NUM_INSTRS_PER_SLOT] =3D { > + params->addr_reg =3D ptdev->csif_info.cs_reg_count - > + ptdev->csif_info.unpreserved_cs_reg_count; > + params->val_reg =3D params->addr_reg + 2; > + params->cycle_reg =3D params->addr_reg; > + params->time_reg =3D params->val_reg; > + > + params->sync_addr =3D panthor_kernel_bo_gpuva(group->syncobjs) + > + job->queue_idx * sizeof(struct panthor_syncobj_64b); > + params->times_addr =3D panthor_kernel_bo_gpuva(queue->profiling_info.sl= ots) + > + (job->profiling_slot * sizeof(struct panthor_job_profiling_data)= ); > + params->waitall_mask =3D GENMASK(sched->sb_slot_count - 1, 0); > + > + params->cs_start =3D job->call_info.start; > + params->cs_size =3D job->call_info.size; > + params->last_flush =3D job->call_info.latest_flush; > + > + params->profile_mask =3D job->profile_mask; > +} > + > +static void > +prepare_job_instrs(const struct panthor_job_cs_params *params, > + struct panthor_job_ringbuf_instrs *instrs) > +{ > + const struct panthor_job_instr instr_seq[] =3D { > /* MOV32 rX+2, cs.latest_flush */ > - (2ull << 56) | (val_reg << 48) | job->call_info.latest_flush, > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, > + (2ull << 56) | (params->val_reg << 48) | params->last_flush), > =20 > /* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */ > - (36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233, > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, > + (36ull << 56) | (0ull << 48) | (params->val_reg << 40) | (0 << 16) = | 0x233), > + > + /* MOV48 rX:rX+1, cycles_offset */ > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES, > + (1ull << 56) | (params->cycle_reg << 48) | > + (params->times_addr + > + offsetof(struct panthor_job_profiling_data, cycles.before))), > + /* STORE_STATE cycles */ > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES, > + (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)), > + /* MOV48 rX:rX+1, time_offset */ > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP, > + (1ull << 56) | (params->time_reg << 48) | > + (params->times_addr + > + offsetof(struct panthor_job_profiling_data, time.before))), > + /* STORE_STATE timer */ > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP, > + (40ull << 56) | (params->time_reg << 40) | (0ll << 32)), > =20 > /* MOV48 rX:rX+1, cs.start */ > - (1ull << 56) | (addr_reg << 48) | job->call_info.start, > - > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, > + (1ull << 56) | (params->addr_reg << 48) | params->cs_start), > /* MOV32 rX+2, cs.size */ > - (2ull << 56) | (val_reg << 48) | job->call_info.size, > - > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, > + (2ull << 56) | (params->val_reg << 48) | params->cs_size), > /* WAIT(0) =3D> waits for FLUSH_CACHE2 instruction */ > - (3ull << 56) | (1 << 16), > - > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (3ull << 56) | (1 << 16)), > /* CALL rX:rX+1, rX+2 */ > - (32ull << 56) | (addr_reg << 40) | (val_reg << 32), > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, > + (32ull << 56) | (params->addr_reg << 40) | (params->val_reg << 32)), > + > + /* MOV48 rX:rX+1, cycles_offset */ > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES, > + (1ull << 56) | (params->cycle_reg << 48) | > + (params->times_addr + > + offsetof(struct panthor_job_profiling_data, cycles.after))), > + /* STORE_STATE cycles */ > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_CYCLES, > + (40ull << 56) | (params->cycle_reg << 40) | (1ll << 32)), > + > + /* MOV48 rX:rX+1, time_offset */ > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP, > + (1ull << 56) | (params->time_reg << 48) | > + (params->times_addr + > + offsetof(struct panthor_job_profiling_data, time.after))), > + /* STORE_STATE timer */ > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_TIMESTAMP, > + (40ull << 56) | (params->time_reg << 40) | (0ll << 32)), > =20 > /* MOV48 rX:rX+1, sync_addr */ > - (1ull << 56) | (addr_reg << 48) | sync_addr, > - > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, > + (1ull << 56) | (params->addr_reg << 48) | params->sync_addr), > /* MOV48 rX+2, #1 */ > - (1ull << 56) | (val_reg << 48) | 1, > - > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, > + (1ull << 56) | (params->val_reg << 48) | 1), > /* WAIT(all) */ > - (3ull << 56) | (waitall_mask << 16), > - > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, > + (3ull << 56) | (params->waitall_mask << 16)), > /* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/ > - (51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0= << 16) | 1, > - > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, > + (51ull << 56) | (0ull << 48) | (params->addr_reg << 40) | > + (params->val_reg << 32) | (0 << 16) | 1), > /* ERROR_BARRIER, so we can recover from faults at job > * boundaries. > */ > - (47ull << 56), > + JOB_INSTR(PANTHOR_DEVICE_PROFILING_DISABLED, (47ull << 56)), > + }; > + u32 pad; > + > + /* NEED to be cacheline aligned to please the prefetcher. */ > + static_assert(sizeof(instrs->buffer) % 64 =3D=3D 0, > + "panthor_job_ringbuf_instrs::buffer is not aligned on a cachelin= e"); > + > + /* Make sure we have enough storage to store the whole sequence. */ > + static_assert(ALIGN(ARRAY_SIZE(instr_seq), NUM_INSTRS_PER_CACHE_LINE) <= =3D > + ARRAY_SIZE(instrs->buffer), > + "instr_seq vs panthor_job_ringbuf_instrs::buffer size mismatch"); We probably want to catch situations where instrs->buffer has gone bigger than needed (say we found a way to drop instructions), so I would turn the '<=3D' condition into an '=3D=3D'. > + > + for (u32 i =3D 0; i < ARRAY_SIZE(instr_seq); i++) { > + /* If the profile mask of this instruction is not enabled, skip it. */ > + if (instr_seq[i].profile_mask && > + !(instr_seq[i].profile_mask & params->profile_mask)) > + continue; > + > + instrs->buffer[instrs->count++] =3D instr_seq[i].instr; > + } > + > + pad =3D ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE); > + memset(&instrs->buffer[instrs->count], 0, > + (pad - instrs->count) * sizeof(instrs->buffer[0])); > + instrs->count =3D pad; > +} > + > +static u32 calc_job_credits(u32 profile_mask) > +{ > + struct panthor_job_ringbuf_instrs instrs; > + struct panthor_job_cs_params params =3D { > + .profile_mask =3D profile_mask, > }; > =20 > - /* Need to be cacheline aligned to please the prefetcher. */ > - static_assert(sizeof(call_instrs) % 64 =3D=3D 0, > - "call_instrs is not aligned on a cacheline"); > + prepare_job_instrs(¶ms, &instrs); > + return instrs.count; > +} > + > +static struct dma_fence * > +queue_run_job(struct drm_sched_job *sched_job) > +{ > + struct panthor_job *job =3D container_of(sched_job, struct panthor_job,= base); > + struct panthor_group *group =3D job->group; > + struct panthor_queue *queue =3D group->queues[job->queue_idx]; > + struct panthor_device *ptdev =3D group->ptdev; > + struct panthor_scheduler *sched =3D ptdev->scheduler; > + struct panthor_job_ringbuf_instrs instrs; > + struct panthor_job_cs_params cs_params; > + struct dma_fence *done_fence; > + int ret; > =20 > /* Stream size is zero, nothing to do except making sure all previously > * submitted jobs are done before we signal the > @@ -2900,17 +3078,23 @@ queue_run_job(struct drm_sched_job *sched_job) > queue->fence_ctx.id, > atomic64_inc_return(&queue->fence_ctx.seqno)); > =20 > - memcpy(queue->ringbuf->kmap + ringbuf_insert, > - call_instrs, sizeof(call_instrs)); > + job->profiling_slot =3D queue->profiling_info.profiling_seqno++; > + if (queue->profiling_info.profiling_seqno =3D=3D queue->profiling_info.= slot_count) > + queue->profiling_info.profiling_seqno =3D 0; > + > + job->ringbuf.start =3D queue->iface.input->insert; > + > + get_job_cs_params(job, &cs_params); > + prepare_job_instrs(&cs_params, &instrs); > + copy_instrs_to_ringbuf(queue, job, &instrs); > + > + job->ringbuf.end =3D job->ringbuf.start + (instrs.count * sizeof(u64)); > =20 > panthor_job_get(&job->base); > spin_lock(&queue->fence_ctx.lock); > list_add_tail(&job->node, &queue->fence_ctx.in_flight_jobs); > spin_unlock(&queue->fence_ctx.lock); > =20 > - job->ringbuf.start =3D queue->iface.input->insert; > - job->ringbuf.end =3D job->ringbuf.start + sizeof(call_instrs); > - > /* Make sure the ring buffer is updated before the INSERT > * register. > */ > @@ -3003,6 +3187,24 @@ static const struct drm_sched_backend_ops panthor_= queue_sched_ops =3D { > .free_job =3D queue_free_job, > }; > =20 > +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev, > + u32 cs_ringbuf_size) > +{ > + u32 min_profiled_job_instrs =3D U32_MAX; > + u32 last_flag =3D fls(PANTHOR_DEVICE_PROFILING_ALL); > + > + for (u32 i =3D 0; i < last_flag; i++) { > + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) > + min_profiled_job_instrs =3D > + min(min_profiled_job_instrs, calc_job_credits(BIT(i))); > + } Okay, I think this loop deserves an explanation. The goal is to calculate the minimal size of a profile job so we can deduce the maximum number of profiling slots that will be used simultaneously. We ignore PANTHOR_DEVICE_PROFILING_DISABLED, because those jobs won't use a profiling slot in the first place. > + > + drm_WARN_ON(&ptdev->base, > + !IS_ALIGNED(min_profiled_job_instrs, NUM_INSTRS_PER_CACHE_LINE)); We can probably drop this WARN_ON(), it's supposed to be checked in calc_job_credits(). > + > + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u= 64)); > +} > + > static struct panthor_queue * > group_create_queue(struct panthor_group *group, > const struct drm_panthor_queue_create *args) > @@ -3056,9 +3258,38 @@ group_create_queue(struct panthor_group *group, > goto err_free_queue; > } > =20 > + queue->profiling_info.slot_count =3D > + calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size); > + > + queue->profiling_info.slots =3D > + panthor_kernel_bo_create(group->ptdev, group->vm, > + queue->profiling_info.slot_count * > + sizeof(struct panthor_job_profiling_data), > + DRM_PANTHOR_BO_NO_MMAP, > + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | > + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED, > + PANTHOR_VM_KERNEL_AUTO_VA); > + > + if (IS_ERR(queue->profiling_info.slots)) { > + ret =3D PTR_ERR(queue->profiling_info.slots); > + goto err_free_queue; > + } > + > + ret =3D panthor_kernel_bo_vmap(queue->profiling_info.slots); > + if (ret) > + goto err_free_queue; > + > + memset(queue->profiling_info.slots->kmap, 0, > + queue->profiling_info.slot_count * sizeof(struct panthor_job_pro= filing_data)); I don't think we need to memset() the profiling buffer. > + > + /* > + * Credit limit argument tells us the total number of instructions > + * across all CS slots in the ringbuffer, with some jobs requiring > + * twice as many as others, depending on their profiling status. > + */ > ret =3D drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops, > group->ptdev->scheduler->wq, 1, > - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)), > + args->ringbuf_size / sizeof(u64), > 0, msecs_to_jiffies(JOB_TIMEOUT_MS), > group->ptdev->reset.wq, > NULL, "panthor-queue", group->ptdev->base.dev); > @@ -3354,6 +3585,7 @@ panthor_job_create(struct panthor_file *pfile, > { > struct panthor_group_pool *gpool =3D pfile->groups; > struct panthor_job *job; > + u32 credits; > int ret; > =20 > if (qsubmit->pad) > @@ -3407,9 +3639,16 @@ panthor_job_create(struct panthor_file *pfile, > } > } > =20 > + job->profile_mask =3D pfile->ptdev->profile_mask; > + credits =3D calc_job_credits(job->profile_mask); > + if (credits =3D=3D 0) { > + ret =3D -EINVAL; > + goto err_put_job; > + } > + > ret =3D drm_sched_job_init(&job->base, > &job->group->queues[job->queue_idx]->entity, > - 1, job->group); > + credits, job->group); > if (ret) > goto err_put_job; > =20 Just add a bunch of minor comments (mostly cosmetic changes), but the implementation looks good to me. Reviewed-by: Boris Brezillon