Re: [Intel-xe] [PATCH v4 02/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Luben Tuikov <luben.tuikov@amd.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: robdclark@chromium.org, sarah.walker@imgtec.com,
	ketil.johnsen@arm.com, Liviu.Dudau@arm.com, mcanal@igalia.com,
	frank.binns@imgtec.com, dri-devel@lists.freedesktop.org,
	christian.koenig@amd.com, boris.brezillon@collabora.com,
	dakr@redhat.com, donald.robson@imgtec.com, daniel@ffwll.ch,
	lina@asahilina.net, airlied@gmail.com,
	intel-xe@lists.freedesktop.org, faith.ekstrand@collabora.com
Subject: Re: [Intel-xe] [PATCH v4 02/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread
Date: Thu, 5 Oct 2023 00:13:01 -0400	[thread overview]
Message-ID: <a39eb381-4f2b-439b-b223-c5148167b225@amd.com> (raw)
In-Reply-To: <ZR4upS/Mkh0lkzJ0@DUT025-TGLU.fm.intel.com>

On 2023-10-04 23:33, Matthew Brost wrote:
> On Tue, Sep 26, 2023 at 11:32:10PM -0400, Luben Tuikov wrote:
>> Hi,
>>
>> On 2023-09-19 01:01, Matthew Brost wrote:
>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
>>> seems a bit odd but let us explain the reasoning below.
>>>
>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>> guaranteed to be the same completion even if targeting the same hardware
>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>> which allowed to reorder, timeslice, and preempt submissions. If a using
>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
>>> apart as the TDR expects submission order == completion order. Using a
>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>
>>> 2. In XE submissions are done via programming a ring buffer (circular
>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
>>> control on the ring for free.
>>>
>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>> kthread for submission / job cleanup. This doesn't scale if a large
>>> number of drm_gpu_scheduler are used. To work around the scaling issue,
>>> use a worker rather than kthread for submission / job cleanup.
>>>
>>> v2:
>>>   - (Rob Clark) Fix msm build
>>>   - Pass in run work queue
>>> v3:
>>>   - (Boris) don't have loop in worker
>>> v4:
>>>   - (Tvrtko) break out submit ready, stop, start helpers into own patch
>>> v5:
>>>   - (Boris) default to ordered work queue
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   2 +-
>>>  drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   2 +-
>>>  drivers/gpu/drm/lima/lima_sched.c          |   2 +-
>>>  drivers/gpu/drm/msm/msm_ringbuffer.c       |   2 +-
>>>  drivers/gpu/drm/nouveau/nouveau_sched.c    |   2 +-
>>>  drivers/gpu/drm/panfrost/panfrost_job.c    |   2 +-
>>>  drivers/gpu/drm/scheduler/sched_main.c     | 118 ++++++++++-----------
>>>  drivers/gpu/drm/v3d/v3d_sched.c            |  10 +-
>>>  include/drm/gpu_scheduler.h                |  14 ++-
>>>  9 files changed, 79 insertions(+), 75 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index e366f61c3aed..16f3cfe1574a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2279,7 +2279,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>>>  			break;
>>>  		}
>>>  
>>> -		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>>> +		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops, NULL,
>>>  				   ring->num_hw_submission, 0,
>>>  				   timeout, adev->reset_domain->wq,
>>>  				   ring->sched_score, ring->name,
>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>> index 345fec6cb1a4..618a804ddc34 100644
>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>> @@ -134,7 +134,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>>>  {
>>>  	int ret;
>>>  
>>> -	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
>>> +	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, NULL,
>>>  			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
>>>  			     msecs_to_jiffies(500), NULL, NULL,
>>>  			     dev_name(gpu->dev), gpu->dev);
>>> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
>>> index ffd91a5ee299..8d858aed0e56 100644
>>> --- a/drivers/gpu/drm/lima/lima_sched.c
>>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>>> @@ -488,7 +488,7 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
>>>  
>>>  	INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
>>>  
>>> -	return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
>>> +	return drm_sched_init(&pipe->base, &lima_sched_ops, NULL, 1,
>>>  			      lima_job_hang_limit,
>>>  			      msecs_to_jiffies(timeout), NULL,
>>>  			      NULL, name, pipe->ldev->dev);
>>> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
>>> index 40c0bc35a44c..b8865e61b40f 100644
>>> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
>>> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
>>> @@ -94,7 +94,7 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
>>>  	 /* currently managing hangcheck ourselves: */
>>>  	sched_timeout = MAX_SCHEDULE_TIMEOUT;
>>>  
>>> -	ret = drm_sched_init(&ring->sched, &msm_sched_ops,
>>> +	ret = drm_sched_init(&ring->sched, &msm_sched_ops, NULL,
>>>  			num_hw_submissions, 0, sched_timeout,
>>>  			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
>>
>> checkpatch.pl complains here about unmatched open parens.
>>
> 
> Will fix and run checkpatch before posting next rev.
> 
>>>  	if (ret) {
>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>>> index 88217185e0f3..d458c2227d4f 100644
>>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>>> @@ -429,7 +429,7 @@ int nouveau_sched_init(struct nouveau_drm *drm)
>>>  	if (!drm->sched_wq)
>>>  		return -ENOMEM;
>>>  
>>> -	return drm_sched_init(sched, &nouveau_sched_ops,
>>> +	return drm_sched_init(sched, &nouveau_sched_ops, NULL,
>>>  			      NOUVEAU_SCHED_HW_SUBMISSIONS, 0, job_hang_limit,
>>>  			      NULL, NULL, "nouveau_sched", drm->dev->dev);
>>>  }
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
>>> index 033f5e684707..326ca1ddf1d7 100644
>>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>>> @@ -831,7 +831,7 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>>>  		js->queue[j].fence_context = dma_fence_context_alloc(1);
>>>  
>>>  		ret = drm_sched_init(&js->queue[j].sched,
>>> -				     &panfrost_sched_ops,
>>> +				     &panfrost_sched_ops, NULL,
>>>  				     nentries, 0,
>>>  				     msecs_to_jiffies(JOB_TIMEOUT_MS),
>>>  				     pfdev->reset.wq,
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index e4fa62abca41..ee6281942e36 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -48,7 +48,6 @@
>>>   * through the jobs entity pointer.
>>>   */
>>>  
>>> -#include <linux/kthread.h>
>>>  #include <linux/wait.h>
>>>  #include <linux/sched.h>
>>>  #include <linux/completion.h>
>>> @@ -256,6 +255,16 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>  	return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL;
>>>  }
>>>  
>>> +/**
>>> + * drm_sched_submit_queue - scheduler queue submission
>>
>> There is no verb in the description, and is not clear what
>> this function does unless one reads the code. Given that this
>> is DOC, this should be clearer here. Something like "queue
>> scheduler work to be executed" or something to that effect.
>>
> 
> Will fix.
>  
>> Coming back to this from reading the patch below, it was somewhat
>> unclear what "drm_sched_submit_queue()" does, since when reading
>> below, "submit" was being read by my mind as an adjective, as opposed
>> to a verb. Perhaps something like:
>>
>> drm_sched_queue_submit(), or
>> drm_sched_queue_exec(), or
>> drm_sched_queue_push(), or something to that effect. You pick. :-)
>>
> 
> I prefer the name as is. In this patch we have:
> 
> drm_sched_submit_queue()
> drm_sched_submit_start)
> drm_sched_submit_stop()
> drm_sched_submit_ready()
> 
> I like all these functions start with 'drm_sched_submit' which allows
> for easy searching for the functions that touch the DRM scheduler
> submission state.
> 
> With a little better doc are you fine with the names as is.

Notice the following scheme in the naming,

drm_sched_submit_queue()
drm_sched_submit_start)
drm_sched_submit_stop()
drm_sched_submit_ready()
\---+---/ \--+-/ \-+-/ 
    |        |     +---> a verb
    |        +---------> should be a noun (something in the component)
    +------------------> the kernel/software component

And although "queue" can technically be used as a verb too, I'd rather it be "enqueue",
like this:

drm_sched_submit_enqueue()

And using "submit" as the noun of the component is a bit cringy,
since "submit" is really a verb, and it's cringy to make it a "state"
or an "object" we operate on in the DRM Scheduler. "Submission" is
a noun, but "submission enqueue/start/stop/ready" doesn't sound
very well thought out. "Submission" really is what the work-queue
does.

I'd rather it be a real object, like for instance,

drm_sched_wqueue_enqueue()
drm_sched_wqueue_start)
drm_sched_wqueue_stop()
drm_sched_wqueue_ready()

Which tells me that the component is the DRM Scheduler, the object is a/the work-queue,
and the last word as the verb, is the action we're performing on the object, i.e. the work-queue.
Plus, all these functions actually do operate on work-queues, directly or indirectly, 
are new, so it's a win-win naming scheme.

I think that that would be most likeable.
-- 
Regards,
Luben

next prev parent reply	other threads:[~2023-10-05  4:13 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-19  5:01 [Intel-xe] [PATCH v4 00/10] DRM scheduler changes for Xe Matthew Brost
2023-09-19  5:01 ` [Intel-xe] [PATCH v4 01/10] drm/sched: Add drm_sched_submit_* helpers Matthew Brost
2023-09-19  5:58   ` Christian König
2023-09-21  3:41     ` Luben Tuikov
2023-09-27  1:07   ` Luben Tuikov
2023-09-19  5:01 ` [Intel-xe] [PATCH v4 02/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
2023-09-27  3:32   ` Luben Tuikov
2023-10-05  3:33     ` Matthew Brost
2023-10-05  4:13       ` Luben Tuikov [this message]
2023-10-05 15:19         ` Matthew Brost
2023-10-06  7:59         ` Tvrtko Ursulin
2023-10-06 15:14           ` Matthew Brost
2023-10-06 23:43             ` Matthew Brost
2023-10-09  8:35               ` Tvrtko Ursulin
2023-10-11 23:19               ` Luben Tuikov
2023-10-11 23:11             ` Luben Tuikov
2023-10-11 23:10           ` Luben Tuikov
2023-09-19  5:01 ` [Intel-xe] [PATCH v4 03/10] drm/sched: Move schedule policy to scheduler Matthew Brost
2023-09-24  1:18   ` kernel test robot
2023-09-27 12:13   ` Luben Tuikov
2023-09-19  5:01 ` [Intel-xe] [PATCH v4 04/10] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy Matthew Brost
2023-09-27 14:36   ` Luben Tuikov
2023-10-05  4:02     ` Matthew Brost
2023-09-19  5:01 ` [Intel-xe] [PATCH v4 05/10] drm/sched: Split free_job into own work item Matthew Brost
2023-09-28 16:14   ` Luben Tuikov
2023-10-05  4:06     ` Matthew Brost
2023-10-11 23:29       ` Luben Tuikov
2023-09-19  5:01 ` [Intel-xe] [PATCH v4 06/10] drm/sched: Add drm_sched_start_timeout_unlocked helper Matthew Brost
2023-09-29 21:23   ` Luben Tuikov
2023-09-19  5:01 ` [Intel-xe] [PATCH v4 07/10] drm/sched: Start submission before TDR in drm_sched_start Matthew Brost
2023-09-29 21:53   ` Luben Tuikov
2023-09-30 19:48     ` Luben Tuikov
2023-10-05  3:11       ` Matthew Brost
2023-10-05  3:18         ` Luben Tuikov
2023-09-19  5:01 ` [Intel-xe] [PATCH v4 08/10] drm/sched: Submit job before starting TDR Matthew Brost
2023-09-29 21:58   ` Luben Tuikov
2023-10-05  4:11     ` Matthew Brost
2023-09-19  5:01 ` [Intel-xe] [PATCH v4 09/10] drm/sched: Add helper to queue TDR immediately for current and future jobs Matthew Brost
2023-09-29 22:44   ` Luben Tuikov
2023-10-05  3:22     ` Matthew Brost
2023-09-19  5:01 ` [Intel-xe] [PATCH v4 10/10] drm/sched: Update maintainers of GPU scheduler Matthew Brost
2023-09-19  5:32 ` [Intel-xe] ✗ CI.Patch_applied: failure for DRM scheduler changes for Xe (rev6) Patchwork
2023-09-19 11:44 ` [Intel-xe] [PATCH v4 00/10] DRM scheduler changes for Xe Danilo Krummrich
2023-09-25 21:47   ` Danilo Krummrich
2023-09-27  7:33 ` Boris Brezillon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a39eb381-4f2b-439b-b223-c5148167b225@amd.com \
    --to=luben.tuikov@amd.com \
    --cc=Liviu.Dudau@arm.com \
    --cc=airlied@gmail.com \
    --cc=boris.brezillon@collabora.com \
    --cc=christian.koenig@amd.com \
    --cc=dakr@redhat.com \
    --cc=daniel@ffwll.ch \
    --cc=donald.robson@imgtec.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=faith.ekstrand@collabora.com \
    --cc=frank.binns@imgtec.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=ketil.johnsen@arm.com \
    --cc=lina@asahilina.net \
    --cc=matthew.brost@intel.com \
    --cc=mcanal@igalia.com \
    --cc=robdclark@chromium.org \
    --cc=sarah.walker@imgtec.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox