Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Luben Tuikov <luben.tuikov@amd.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: robdclark@chromium.org, sarah.walker@imgtec.com,
	ketil.johnsen@arm.com, lina@asahilina.net, mcanal@igalia.com,
	Liviu.Dudau@arm.com, dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, boris.brezillon@collabora.com,
	donald.robson@imgtec.com, christian.koenig@amd.com,
	faith.ekstrand@collabora.com
Subject: Re: [Intel-xe] [PATCH v3 09/13] drm/sched: Submit job before starting TDR
Date: Wed, 20 Sep 2023 23:35:13 -0400	[thread overview]
Message-ID: <8e5eab14-9e55-42c9-b6ea-02fcc591266d@amd.com> (raw)
In-Reply-To: <ZQNHZm4HPeyPBPMQ@DUT025-TGLU.fm.intel.com>

On 2023-09-14 13:48, Matthew Brost wrote:
> On Wed, Sep 13, 2023 at 10:56:10PM -0400, Luben Tuikov wrote:
>> On 2023-09-11 22:16, Matthew Brost wrote:
>>> If the TDR is set to a value, it can fire before a job is submitted in
>>> drm_sched_main. The job should be always be submitted before the TDR
>>> fires, fix this ordering.
>>>
>>> v2:
>>>   - Add to pending list before run_job, start TDR after (Luben, Boris)
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index c627d3e6494a..9dbfab7be2c6 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -498,7 +498,6 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
>>>  
>>>  	spin_lock(&sched->job_list_lock);
>>>  	list_add_tail(&s_job->list, &sched->pending_list);
>>> -	drm_sched_start_timeout(sched);
>>>  	spin_unlock(&sched->job_list_lock);
>>>  }
>>>  
>>> @@ -1234,6 +1233,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>  		fence = sched->ops->run_job(sched_job);
>>>  		complete_all(&entity->entity_idle);
>>>  		drm_sched_fence_scheduled(s_fence, fence);
>>> +		drm_sched_start_timeout_unlocked(sched);
>>>  
>>>  		if (!IS_ERR_OR_NULL(fence)) {
>>>  			/* Drop for original kref_init of the fence */
>>
>> So, sched->ops->run_job(), is a "job inflection point" from the point of view of
>> the DRM scheduler. After that call, DRM has relinquished control of the job to the
>> firmware/hardware.
>>
>> Putting the job in the pending list, before submitting it to down to the firmware/hardware,
>> goes along with starting a timeout timer for the job. The timeout always includes
>> time for the firmware/hardware to get it prepped, as well as time for the actual
>> execution of the job (task). Thus, we want to do this:
>> 	1. Put the job in pending list. "Pending list" means "pends in hardware".
>> 	2. Start a timeout timer for the job.
>> 	3. Start executing the job/task. This usually involves giving it to firmware/hardware,
>> 	   i.e. ownership of the job/task changes to another domain. In our case this is accomplished
>> 	   by calling sched->ops->run_job().
>> Perhaps move drm_sched_start_timeout() closer to sched->ops->run_job() from above and/or increase
>> the timeout value?
> 
> I disagree. It is clear race if the timeout starts before run_job() that
> the TDR can fire before run_job() is called. The entire point of this

Then that would mean that 1) the timeout time is too short, and/or 2) the firmware/hardware
took a really long time to complete the job (from the point of view of the scheduler TDR).

> patch is to seal this race by starting the TDR after run_job() is
> called.

Once you call run_job() you're no longer in control of the job and things can
happen, like this job being returned/cancelled due to reasons out of the scheduler's
control. If you started the timeout _after_ submitting the job to the hardware,
you may be racing with what the hardware might want to do to the job as described
in the previous sentence.

The timeout timer should start before we give away the job to the hardware.
This is not that dissimilar to sending a network packet out the interface.

If you need a longer timeout time, then we can do that, but starting the timeout
after giving away the job to the hardware is a no-go.

-- 
Regards,
Luben


  reply	other threads:[~2023-09-21  3:35 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-12  2:16 [Intel-xe] [PATCH v3 00/13] DRM scheduler changes for Xe Matthew Brost
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 01/13] drm/sched: Add drm_sched_submit_* helpers Matthew Brost
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 02/13] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
2023-09-12  7:29   ` Boris Brezillon
2023-09-12 15:02     ` Matthew Brost
2023-09-14  3:41       ` Luben Tuikov
2023-09-14  3:35   ` Luben Tuikov
2023-09-16 17:07   ` Danilo Krummrich
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 03/13] drm/sched: Move schedule policy to scheduler / entity Matthew Brost
2023-09-12  7:37   ` Boris Brezillon
2023-09-12 15:14     ` Matthew Brost
2023-09-12 14:11   ` kernel test robot
2023-09-12 15:17     ` Matthew Brost
2023-09-14  4:18   ` Luben Tuikov
2023-09-14  4:23     ` Luben Tuikov
2023-09-14 15:48       ` Matthew Brost
2023-09-14 15:49     ` Matthew Brost
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 04/13] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy Matthew Brost
2023-09-13 12:30   ` kernel test robot
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 05/13] drm/sched: Split free_job into own work item Matthew Brost
2023-09-12  8:08   ` Boris Brezillon
2023-09-12 14:37     ` Matthew Brost
2023-09-12 14:53       ` Boris Brezillon
2023-09-12 14:55         ` Matthew Brost
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 06/13] drm/sched: Add generic scheduler message interface Matthew Brost
2023-09-12  8:23   ` Boris Brezillon
2023-09-12 14:50     ` Matthew Brost
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 07/13] drm/sched: Add drm_sched_start_timeout_unlocked helper Matthew Brost
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 08/13] drm/sched: Start run wq before TDR in drm_sched_start Matthew Brost
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 09/13] drm/sched: Submit job before starting TDR Matthew Brost
2023-09-14  2:56   ` Luben Tuikov
2023-09-14 17:48     ` Matthew Brost
2023-09-21  3:35       ` Luben Tuikov [this message]
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 10/13] drm/sched: Add helper to set TDR timeout Matthew Brost
2023-09-14  2:38   ` Luben Tuikov
2023-09-14 17:36     ` Matthew Brost
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 11/13] drm/sched: Waiting for pending jobs to complete in scheduler kill Matthew Brost
2023-09-12  8:44   ` Boris Brezillon
2023-09-12  9:57   ` Christian König
2023-09-12 14:47     ` Matthew Brost
2023-09-16 17:52       ` Danilo Krummrich
2023-09-18 11:03         ` Christian König
2023-09-18 14:57           ` Danilo Krummrich
2023-09-19  5:55             ` Christian König
2023-09-12 10:28   ` Boris Brezillon
2023-09-12 14:54     ` Matthew Brost
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 12/13] drm/sched/doc: Add Entity teardown documentaion Matthew Brost
2023-09-13 15:04   ` Christian König
2023-09-14  2:06   ` Luben Tuikov
2023-09-16 18:06   ` Danilo Krummrich
2023-09-12  2:16 ` [Intel-xe] [PATCH v3 13/13] drm/sched: Update maintainers of GPU scheduler Matthew Brost
2023-09-12  2:20 ` [Intel-xe] ✗ CI.Patch_applied: failure for DRM scheduler changes for Xe (rev5) Patchwork
2023-09-14  1:45 ` [Intel-xe] [PATCH v3 00/13] DRM scheduler changes for Xe Luben Tuikov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8e5eab14-9e55-42c9-b6ea-02fcc591266d@amd.com \
    --to=luben.tuikov@amd.com \
    --cc=Liviu.Dudau@arm.com \
    --cc=boris.brezillon@collabora.com \
    --cc=christian.koenig@amd.com \
    --cc=donald.robson@imgtec.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=faith.ekstrand@collabora.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=ketil.johnsen@arm.com \
    --cc=lina@asahilina.net \
    --cc=matthew.brost@intel.com \
    --cc=mcanal@igalia.com \
    --cc=robdclark@chromium.org \
    --cc=sarah.walker@imgtec.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox